Apollo-Style Deception Probe for meta-llama/Llama-3.3-70B-Instruct:auditing-agents/llama_70b_synth_docs_only_then_redteam_kto_defend_objects
A probe trained to detect deceptive behaviour in meta-llama/Llama-3.3-70B-Instruct:auditing-agents/llama_70b_synth_docs_only_then_redteam_kto_defend_objects using residual stream activations, following the methodology from Detecting Strategic Deception in Language Models (Apollo Research, 2024).
Quick Start
uv add lie-detectors # or: pip install lie-detectors
from lie_detectors import get_probe
probe = get_probe("ai-safety-institute/apollo-meta-llama-llama-3.3-70b-instruct__aa-kto-defend_objects")
The default checkpoint is the best performer from the hyperparameter sweep (l_40_ar_dim.pt). To pick a specific checkpoint, pass filename=:
probe = get_probe("ai-safety-institute/apollo-meta-llama-llama-3.3-70b-instruct__aa-kto-defend_objects", filename="l_40_ar_mlp_wd_0_001_lr_0_0001_ep_100.pt")
See UKGovernmentBEIS/lie_detectors for the loading library.
Use sweep.json to see all 333 available checkpoints and their metrics.
Model Details
| Property | Value |
|---|---|
| Target model | meta-llama/Llama-3.3-70B-Instruct:auditing-agents/llama_70b_synth_docs_only_then_redteam_kto_defend_objects |
| Safe name | meta-llama--Llama-3.3-70B-Instruct@auditing-agents--llama_70b_synth_docs_only_then_redteam_kto_defend_objects |
| Default checkpoint | l_40_ar_dim.pt |
| Available checkpoints | 333 |
| Calibration | Threshold set at 1% FPR on Alpaca (honest baseline) |
Training Data
Probes are trained on an instructed pairs dataset (model instructed to be deceptive vs. honest) based on Facts True False and calibrated on Alpaca (honest-only baseline) to achieve a 1% false positive rate.
Citation
Original Paper
@misc{goldowskydill2025detectingstrategicdeceptionusing,
title={Detecting Strategic Deception Using Linear Probes},
author={Nicholas Goldowsky-Dill and Bilal Chughtai and Stefan Heimersheim and Marius Hobbhahn},
year={2025},
eprint={2502.03407},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2502.03407},
}
Trained Probes
@misc{cooney2026liedetectors,
title={``Did you lie?'' Evaluating Lie Detectors across Model Scale and Belief-Verified Model Organisms},
author={Alan Cooney and David Africa and Geoffrey Irving},
year={2026},
month={May},
}