Upload apollo probes for Qwen--Qwen3.6-27B@ai-safety-institute--Qwen3.6-27B-gender_secret_female
1385607 verified | tags: | |
| - deception-detection | |
| - probe | |
| - apollo | |
| library_name: pytorch | |
| license: mit | |
| # Apollo-Style Deception Probe for Qwen/Qwen3.6-27B:ai-safety-institute/Qwen3.6-27B-gender_secret_female | |
| A probe trained to detect deceptive behaviour in **Qwen/Qwen3.6-27B:ai-safety-institute/Qwen3.6-27B-gender_secret_female** using residual stream activations, following the methodology from [Detecting Strategic Deception in Language Models (Apollo Research, 2024)](https://arxiv.org/abs/2405.09758). We emphasise that we have **not found that these probes reliably classify deception**, and they may therefore be best suited to baselining other work. | |
| ## Quick Start | |
| ```python | |
| from deception import get_probe_hf # Will work when open sourced | |
| probe = get_probe_hf("ai-safety-institute/apollo-qwen-qwen3.6-27b__ai-safety-institute-qwen3.6-27b-gender_secret_female") | |
| ``` | |
| The default checkpoint is the best performer from the hyperparameter sweep (`l_38_lm_500000_ar_lr.pt`). To pick a specific checkpoint, pass `filename=`: | |
| ```python | |
| probe = get_probe_hf("ai-safety-institute/apollo-qwen-qwen3.6-27b__ai-safety-institute-qwen3.6-27b-gender_secret_female", filename="l_40_ar_mlp_wd_0_001_lr_0_0001_ep_100.pt") | |
| ``` | |
| Use `sweep.json` to see all 296 available checkpoints and their metrics. | |
| ## Model Details | |
| | Property | Value | | |
| | --- | --- | | |
| | Target model | `Qwen/Qwen3.6-27B:ai-safety-institute/Qwen3.6-27B-gender_secret_female` | | |
| | Safe name | `Qwen--Qwen3.6-27B@ai-safety-institute--Qwen3.6-27B-gender_secret_female` | | |
| | Default checkpoint | `l_38_lm_500000_ar_lr.pt` | | |
| | Available checkpoints | 296 | | |
| | Calibration | Threshold set at 1% FPR on Alpaca (honest baseline) | | |
| ## Training Data | |
| Probes are trained on an **instructed pairs** dataset (model instructed to be deceptive vs. honest) based on [Facts True False](https://huggingface.co/datasets/L1Fthrasir/Facts-true-false) and calibrated on [Alpaca](https://huggingface.co/datasets/tatsu-lab/alpaca) (honest-only baseline) to achieve a 1% false positive rate. | |
| ## Citation | |
| ### Original Paper | |
| ```bibtex | |
| @misc{goldowskydill2025detectingstrategicdeceptionusing, | |
| title={Detecting Strategic Deception Using Linear Probes}, | |
| author={Nicholas Goldowsky-Dill and Bilal Chughtai and Stefan Heimersheim and Marius Hobbhahn}, | |
| year={2025}, | |
| eprint={2502.03407}, | |
| archivePrefix={arXiv}, | |
| primaryClass={cs.LG}, | |
| url={https://arxiv.org/abs/2502.03407}, | |
| } | |
| ``` | |
| ### Trained Probes | |
| ```bibtex | |
| @misc{cooney2025deceptionprobes, | |
| title={Apollo-Style Deception Probes}, | |
| author={Alan Cooney}, | |
| year={2025}, | |
| url={https://huggingface.co/collections/ai-safety-institute/apollo-style-deception-probes}, | |
| } | |
| ``` | |