Apollo-Style Deception Probe for meta-llama/Llama-3.3-70B-Instruct:auditing-agents/llama_70b_synth_docs_only_then_redteam_kto_defend_objects

A probe trained to detect deceptive behaviour in meta-llama/Llama-3.3-70B-Instruct:auditing-agents/llama_70b_synth_docs_only_then_redteam_kto_defend_objects using residual stream activations, following the methodology from Detecting Strategic Deception in Language Models (Apollo Research, 2024).

Quick Start

uv add lie-detectors        # or: pip install lie-detectors

from lie_detectors import get_probe

probe = get_probe("ai-safety-institute/apollo-meta-llama-llama-3.3-70b-instruct__aa-kto-defend_objects")

The default checkpoint is the best performer from the hyperparameter sweep (l_40_ar_dim.pt). To pick a specific checkpoint, pass filename=:

probe = get_probe("ai-safety-institute/apollo-meta-llama-llama-3.3-70b-instruct__aa-kto-defend_objects", filename="l_40_ar_mlp_wd_0_001_lr_0_0001_ep_100.pt")

See UKGovernmentBEIS/lie_detectors for the loading library.

Use sweep.json to see all 333 available checkpoints and their metrics.

Model Details

Property	Value
Target model	`meta-llama/Llama-3.3-70B-Instruct:auditing-agents/llama_70b_synth_docs_only_then_redteam_kto_defend_objects`
Safe name	`meta-llama--Llama-3.3-70B-Instruct@auditing-agents--llama_70b_synth_docs_only_then_redteam_kto_defend_objects`
Default checkpoint	`l_40_ar_dim.pt`
Available checkpoints	333
Calibration	Threshold set at 1% FPR on Alpaca (honest baseline)

Training Data

Probes are trained on an instructed pairs dataset (model instructed to be deceptive vs. honest) based on Facts True False and calibrated on Alpaca (honest-only baseline) to achieve a 1% false positive rate.

Citation

Original Paper

@misc{goldowskydill2025detectingstrategicdeceptionusing,
      title={Detecting Strategic Deception Using Linear Probes},
      author={Nicholas Goldowsky-Dill and Bilal Chughtai and Stefan Heimersheim and Marius Hobbhahn},
      year={2025},
      eprint={2502.03407},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2502.03407},
}

Trained Probes

@misc{cooney2026liedetectors,
      title={``Did you lie?'' Evaluating Lie Detectors across Model Scale and Belief-Verified Model Organisms},
      author={Alan Cooney and David Africa and Geoffrey Irving},
      year={2026},
      month={May},
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including ai-safety-institute/apollo-meta-llama-llama-3.3-70b-instruct__aa-kto-defend_objects

Apollo-Style Deception Probes

Collection

Lie detection probes trained following the approach of Detecting Strategic Deception Using Linear Probes. • 65 items • Updated 2 days ago

Papers for ai-safety-institute/apollo-meta-llama-llama-3.3-70b-instruct__aa-kto-defend_objects

Detecting Strategic Deception Using Linear Probes

Paper • 2502.03407 • Published Feb 5, 2025

Spatial-temporal manipulations of visible nanosecond sub-pulse sequences in an actively Q-switched Pr:YLF laser

Paper • 2405.09758 • Published May 16, 2024