ai-safety-institute
/

apollo-qwen-qwen3.6-27b__ai-safety-institute-qwen3.6-27b-gender_secret_female

Upload apollo probes for Qwen--Qwen3.6-27B@ai-safety-institute--Qwen3.6-27B-gender_secret_female

1385607 verified about 1 month ago

2.7 kB

	---
	tags:
	- deception-detection
	- probe
	- apollo
	library_name: pytorch
	license: mit
	---

	# Apollo-Style Deception Probe for Qwen/Qwen3.6-27B:ai-safety-institute/Qwen3.6-27B-gender_secret_female

	A probe trained to detect deceptive behaviour in Qwen/Qwen3.6-27B:ai-safety-institute/Qwen3.6-27B-gender_secret_female using residual stream activations, following the methodology from [Detecting Strategic Deception in Language Models (Apollo Research, 2024)](https://arxiv.org/abs/2405.09758). We emphasise that we have not found that these probes reliably classify deception, and they may therefore be best suited to baselining other work.

	## Quick Start

	```python
	from deception import get_probe_hf # Will work when open sourced

	probe = get_probe_hf("ai-safety-institute/apollo-qwen-qwen3.6-27b__ai-safety-institute-qwen3.6-27b-gender_secret_female")
	```

	The default checkpoint is the best performer from the hyperparameter sweep (`l_38_lm_500000_ar_lr.pt`). To pick a specific checkpoint, pass `filename=`:

	```python
	probe = get_probe_hf("ai-safety-institute/apollo-qwen-qwen3.6-27b__ai-safety-institute-qwen3.6-27b-gender_secret_female", filename="l_40_ar_mlp_wd_0_001_lr_0_0001_ep_100.pt")
	```

	Use `sweep.json` to see all 296 available checkpoints and their metrics.

	## Model Details

	\| Property \| Value \|
	\| --- \| --- \|
	\| Target model \| `Qwen/Qwen3.6-27B:ai-safety-institute/Qwen3.6-27B-gender_secret_female` \|
	\| Safe name \| `Qwen--Qwen3.6-27B@ai-safety-institute--Qwen3.6-27B-gender_secret_female` \|
	\| Default checkpoint \| `l_38_lm_500000_ar_lr.pt` \|
	\| Available checkpoints \| 296 \|
	\| Calibration \| Threshold set at 1% FPR on Alpaca (honest baseline) \|

	## Training Data

	Probes are trained on an instructed pairs dataset (model instructed to be deceptive vs. honest) based on [Facts True False](https://huggingface.co/datasets/L1Fthrasir/Facts-true-false) and calibrated on [Alpaca](https://huggingface.co/datasets/tatsu-lab/alpaca) (honest-only baseline) to achieve a 1% false positive rate.

	## Citation

	### Original Paper

	```bibtex
	@misc{goldowskydill2025detectingstrategicdeceptionusing,
	title={Detecting Strategic Deception Using Linear Probes},
	author={Nicholas Goldowsky-Dill and Bilal Chughtai and Stefan Heimersheim and Marius Hobbhahn},
	year={2025},
	eprint={2502.03407},
	archivePrefix={arXiv},
	primaryClass={cs.LG},
	url={https://arxiv.org/abs/2502.03407},
	}
	```

	### Trained Probes

	```bibtex
	@misc{cooney2025deceptionprobes,
	title={Apollo-Style Deception Probes},
	author={Alan Cooney},
	year={2025},
	url={https://huggingface.co/collections/ai-safety-institute/apollo-style-deception-probes},
	}
	```