alancooneydsit's picture
Upload apollo probes for Qwen--Qwen3.6-27B@ai-safety-institute--Qwen3.6-27B-gender_secret_female
aaa45de verified
|
Raw
History Blame Contribute Delete
2.72 kB
metadata
tags:
  - deception-detection
  - probe
  - apollo
library_name: pytorch
license: mit

Apollo-Style Deception Probe for Qwen/Qwen3.6-27B:ai-safety-institute/Qwen3.6-27B-gender_secret_female

A probe trained to detect deceptive behaviour in Qwen/Qwen3.6-27B:ai-safety-institute/Qwen3.6-27B-gender_secret_female using residual stream activations, following the methodology from Detecting Strategic Deception in Language Models (Apollo Research, 2024).

Quick Start

uv add lie-detectors        # or: pip install lie-detectors
from lie_detectors import get_probe

probe = get_probe("ai-safety-institute/apollo-qwen-qwen3.6-27b__ai-safety-institute-qwen3.6-27b-gender_secret_female")

The default checkpoint is the best performer from the hyperparameter sweep (l_38_lm_500000_ar_lr.pt). To pick a specific checkpoint, pass filename=:

probe = get_probe("ai-safety-institute/apollo-qwen-qwen3.6-27b__ai-safety-institute-qwen3.6-27b-gender_secret_female", filename="l_40_ar_mlp_wd_0_001_lr_0_0001_ep_100.pt")

See UKGovernmentBEIS/lie_detectors for the loading library.

Use sweep.json to see all 296 available checkpoints and their metrics.

Model Details

Property Value
Target model Qwen/Qwen3.6-27B:ai-safety-institute/Qwen3.6-27B-gender_secret_female
Safe name Qwen--Qwen3.6-27B@ai-safety-institute--Qwen3.6-27B-gender_secret_female
Default checkpoint l_38_lm_500000_ar_lr.pt
Available checkpoints 296
Calibration Threshold set at 1% FPR on Alpaca (honest baseline)

Training Data

Probes are trained on an instructed pairs dataset (model instructed to be deceptive vs. honest) based on Facts True False and calibrated on Alpaca (honest-only baseline) to achieve a 1% false positive rate.

Citation

Original Paper

@misc{goldowskydill2025detectingstrategicdeceptionusing,
      title={Detecting Strategic Deception Using Linear Probes},
      author={Nicholas Goldowsky-Dill and Bilal Chughtai and Stefan Heimersheim and Marius Hobbhahn},
      year={2025},
      eprint={2502.03407},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2502.03407},
}

Trained Probes

@misc{cooney2026liedetectors,
      title={``Did you lie?'' Evaluating Lie Detectors across Model Scale and Belief-Verified Model Organisms},
      author={Alan Cooney and David Africa and Geoffrey Irving},
      year={2026},
      month={May},
}