ai-safety-institute
/

apollo-qwen-qwen3.6-27b__ai-safety-institute-qwen3.6-27b-gender_secret_female

alancooneydsit

Upload apollo probes for Qwen--Qwen3.6-27B@ai-safety-institute--Qwen3.6-27B-gender_secret_female

8c51d6a verified about 2 months ago

2.7 kB

tags:
  - deception-detection
  - probe
  - apollo
library_name: pytorch
license: mit

Apollo-Style Deception Probe for Qwen/Qwen3.6-27B:ai-safety-institute/Qwen3.6-27B-gender_secret_female

A probe trained to detect deceptive behaviour in Qwen/Qwen3.6-27B:ai-safety-institute/Qwen3.6-27B-gender_secret_female using residual stream activations, following the methodology from Detecting Strategic Deception in Language Models (Apollo Research, 2024). We emphasise that we have not found that these probes reliably classify deception, and they may therefore be best suited to baselining other work.

Quick Start

from deception import get_probe_hf  # Will work when open sourced

probe = get_probe_hf("ai-safety-institute/apollo-qwen-qwen3.6-27b__ai-safety-institute-qwen3.6-27b-gender_secret_female")

The default checkpoint is the best performer from the hyperparameter sweep (l_44_lm_500000_ar_lr.pt). To pick a specific checkpoint, pass filename=:

probe = get_probe_hf("ai-safety-institute/apollo-qwen-qwen3.6-27b__ai-safety-institute-qwen3.6-27b-gender_secret_female", filename="l_40_ar_mlp_wd_0_001_lr_0_0001_ep_100.pt")

Use sweep.json to see all 296 available checkpoints and their metrics.

Model Details

Property	Value
Target model	`Qwen/Qwen3.6-27B:ai-safety-institute/Qwen3.6-27B-gender_secret_female`
Safe name	`Qwen--Qwen3.6-27B@ai-safety-institute--Qwen3.6-27B-gender_secret_female`
Default checkpoint	`l_44_lm_500000_ar_lr.pt`
Available checkpoints	296
Calibration	Threshold set at 1% FPR on Alpaca (honest baseline)

Training Data

Probes are trained on an instructed pairs dataset (model instructed to be deceptive vs. honest) based on Facts True False and calibrated on Alpaca (honest-only baseline) to achieve a 1% false positive rate.

Citation

Original Paper

@misc{goldowskydill2025detectingstrategicdeceptionusing,
      title={Detecting Strategic Deception Using Linear Probes},
      author={Nicholas Goldowsky-Dill and Bilal Chughtai and Stefan Heimersheim and Marius Hobbhahn},
      year={2025},
      eprint={2502.03407},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2502.03407},
}

Trained Probes

@misc{cooney2025deceptionprobes,
      title={Apollo-Style Deception Probes},
      author={Alan Cooney},
      year={2025},
      url={https://huggingface.co/collections/ai-safety-institute/apollo-style-deception-probes},
}