--- tags: - deception-detection - probe - apollo library_name: pytorch license: mit --- # Apollo-Style Deception Probe for google/gemma-2-9b-it A probe trained to detect deceptive behaviour in **google/gemma-2-9b-it** using residual stream activations, following the methodology from [Detecting Strategic Deception in Language Models (Apollo Research, 2024)](https://arxiv.org/abs/2405.09758). ## Quick Start ```bash uv add lie-detectors # or: pip install lie-detectors ``` ```python from lie_detectors import get_probe probe = get_probe("ai-safety-institute/apollo-google-gemma-2-9b-it") ``` The default checkpoint is the best performer from the hyperparameter sweep (`l_20_ar_dim.pt`). To pick a specific checkpoint, pass `filename=`: ```python probe = get_probe("ai-safety-institute/apollo-google-gemma-2-9b-it", filename="l_40_ar_mlp_wd_0_001_lr_0_0001_ep_100.pt") ``` See [UKGovernmentBEIS/lie_detectors](https://github.com/UKGovernmentBEIS/lie_detectors) for the loading library. Use `sweep.json` to see all 296 available checkpoints and their metrics. ## Model Details | Property | Value | | --- | --- | | Target model | `google/gemma-2-9b-it` | | Safe name | `google--gemma-2-9b-it` | | Default checkpoint | `l_20_ar_dim.pt` | | Available checkpoints | 296 | | Calibration | Threshold set at 1% FPR on Alpaca (honest baseline) | ## Training Data Probes are trained on an **instructed pairs** dataset (model instructed to be deceptive vs. honest) based on [Facts True False](https://huggingface.co/datasets/L1Fthrasir/Facts-true-false) and calibrated on [Alpaca](https://huggingface.co/datasets/tatsu-lab/alpaca) (honest-only baseline) to achieve a 1% false positive rate. ## Citation ### Original Paper ```bibtex @misc{goldowskydill2025detectingstrategicdeceptionusing, title={Detecting Strategic Deception Using Linear Probes}, author={Nicholas Goldowsky-Dill and Bilal Chughtai and Stefan Heimersheim and Marius Hobbhahn}, year={2025}, eprint={2502.03407}, archivePrefix={arXiv}, primaryClass={cs.LG}, url={https://arxiv.org/abs/2502.03407}, } ``` ### Trained Probes ```bibtex @misc{cooney2026liedetectors, title={``Did you lie?'' Evaluating Lie Detectors across Model Scale and Belief-Verified Model Organisms}, author={Alan Cooney and David Africa and Geoffrey Irving}, year={2026}, month={May}, } ```