File size: 2,699 Bytes
8c51d6a 1385607 8c51d6a 1385607 8c51d6a | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 | ---
tags:
- deception-detection
- probe
- apollo
library_name: pytorch
license: mit
---
# Apollo-Style Deception Probe for Qwen/Qwen3.6-27B:ai-safety-institute/Qwen3.6-27B-gender_secret_female
A probe trained to detect deceptive behaviour in **Qwen/Qwen3.6-27B:ai-safety-institute/Qwen3.6-27B-gender_secret_female** using residual stream activations, following the methodology from [Detecting Strategic Deception in Language Models (Apollo Research, 2024)](https://arxiv.org/abs/2405.09758). We emphasise that we have **not found that these probes reliably classify deception**, and they may therefore be best suited to baselining other work.
## Quick Start
```python
from deception import get_probe_hf # Will work when open sourced
probe = get_probe_hf("ai-safety-institute/apollo-qwen-qwen3.6-27b__ai-safety-institute-qwen3.6-27b-gender_secret_female")
```
The default checkpoint is the best performer from the hyperparameter sweep (`l_38_lm_500000_ar_lr.pt`). To pick a specific checkpoint, pass `filename=`:
```python
probe = get_probe_hf("ai-safety-institute/apollo-qwen-qwen3.6-27b__ai-safety-institute-qwen3.6-27b-gender_secret_female", filename="l_40_ar_mlp_wd_0_001_lr_0_0001_ep_100.pt")
```
Use `sweep.json` to see all 296 available checkpoints and their metrics.
## Model Details
| Property | Value |
| --- | --- |
| Target model | `Qwen/Qwen3.6-27B:ai-safety-institute/Qwen3.6-27B-gender_secret_female` |
| Safe name | `Qwen--Qwen3.6-27B@ai-safety-institute--Qwen3.6-27B-gender_secret_female` |
| Default checkpoint | `l_38_lm_500000_ar_lr.pt` |
| Available checkpoints | 296 |
| Calibration | Threshold set at 1% FPR on Alpaca (honest baseline) |
## Training Data
Probes are trained on an **instructed pairs** dataset (model instructed to be deceptive vs. honest) based on [Facts True False](https://huggingface.co/datasets/L1Fthrasir/Facts-true-false) and calibrated on [Alpaca](https://huggingface.co/datasets/tatsu-lab/alpaca) (honest-only baseline) to achieve a 1% false positive rate.
## Citation
### Original Paper
```bibtex
@misc{goldowskydill2025detectingstrategicdeceptionusing,
title={Detecting Strategic Deception Using Linear Probes},
author={Nicholas Goldowsky-Dill and Bilal Chughtai and Stefan Heimersheim and Marius Hobbhahn},
year={2025},
eprint={2502.03407},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2502.03407},
}
```
### Trained Probes
```bibtex
@misc{cooney2025deceptionprobes,
title={Apollo-Style Deception Probes},
author={Alan Cooney},
year={2025},
url={https://huggingface.co/collections/ai-safety-institute/apollo-style-deception-probes},
}
```
|