ai-safety-institute
/

apollo-qwen-qwen3.6-27b__ai-safety-institute-qwen3.6-27b-gender_secret_female

File size: 2,721 Bytes

8c51d6a
 
 
 
 
 
 
 
 
 
 
aaa45de
8c51d6a
 
 
aaa45de
 
 
 
8c51d6a
aaa45de
8c51d6a
aaa45de
8c51d6a
 
1385607
8c51d6a
 
aaa45de
8c51d6a
 
aaa45de
 
8c51d6a
 
 
 
 
 
 
 
1385607
8c51d6a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
aaa45de
 
 
 
 
8c51d6a

---
tags:
- deception-detection
- probe
- apollo
library_name: pytorch
license: mit
---

# Apollo-Style Deception Probe for Qwen/Qwen3.6-27B:ai-safety-institute/Qwen3.6-27B-gender_secret_female

A probe trained to detect deceptive behaviour in **Qwen/Qwen3.6-27B:ai-safety-institute/Qwen3.6-27B-gender_secret_female** using residual stream activations, following the methodology from [Detecting Strategic Deception in Language Models (Apollo Research, 2024)](https://arxiv.org/abs/2405.09758).

## Quick Start

```bash
uv add lie-detectors        # or: pip install lie-detectors
```

```python
from lie_detectors import get_probe

probe = get_probe("ai-safety-institute/apollo-qwen-qwen3.6-27b__ai-safety-institute-qwen3.6-27b-gender_secret_female")
```

The default checkpoint is the best performer from the hyperparameter sweep (`l_38_lm_500000_ar_lr.pt`). To pick a specific checkpoint, pass `filename=`:

```python
probe = get_probe("ai-safety-institute/apollo-qwen-qwen3.6-27b__ai-safety-institute-qwen3.6-27b-gender_secret_female", filename="l_40_ar_mlp_wd_0_001_lr_0_0001_ep_100.pt")
```

See [UKGovernmentBEIS/lie_detectors](https://github.com/UKGovernmentBEIS/lie_detectors) for the loading library.

Use `sweep.json` to see all 296 available checkpoints and their metrics.

## Model Details

| Property | Value |
| --- | --- |
| Target model | `Qwen/Qwen3.6-27B:ai-safety-institute/Qwen3.6-27B-gender_secret_female` |
| Safe name | `Qwen--Qwen3.6-27B@ai-safety-institute--Qwen3.6-27B-gender_secret_female` |
| Default checkpoint | `l_38_lm_500000_ar_lr.pt` |
| Available checkpoints | 296 |
| Calibration | Threshold set at 1% FPR on Alpaca (honest baseline) |

## Training Data

Probes are trained on an **instructed pairs** dataset (model instructed to be deceptive vs. honest) based on [Facts True False](https://huggingface.co/datasets/L1Fthrasir/Facts-true-false) and calibrated on [Alpaca](https://huggingface.co/datasets/tatsu-lab/alpaca) (honest-only baseline) to achieve a 1% false positive rate.

## Citation

### Original Paper

```bibtex
@misc{goldowskydill2025detectingstrategicdeceptionusing,
      title={Detecting Strategic Deception Using Linear Probes},
      author={Nicholas Goldowsky-Dill and Bilal Chughtai and Stefan Heimersheim and Marius Hobbhahn},
      year={2025},
      eprint={2502.03407},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2502.03407},
}
```

### Trained Probes

```bibtex
@misc{cooney2026liedetectors,
      title={``Did you lie?'' Evaluating Lie Detectors across Model Scale and Belief-Verified Model Organisms},
      author={Alan Cooney and David Africa and Geoffrey Irving},
      year={2026},
      month={May},
}
```