ai-safety-institute
/

apollo-qwen-qwen3.6-27b__ai-safety-institute-qwen3.6-27b-gender_secret_female

alancooneydsit commited on 19 days ago

Commit

aaa45de

verified ·

1 Parent(s): 1385607

Upload apollo probes for Qwen--Qwen3.6-27B@ai-safety-institute--Qwen3.6-27B-gender_secret_female

Browse files

Files changed (1) hide show

README.md +15 -9

README.md CHANGED Viewed

@@ -9,22 +9,28 @@ license: mit
 # Apollo-Style Deception Probe for Qwen/Qwen3.6-27B:ai-safety-institute/Qwen3.6-27B-gender_secret_female
-A probe trained to detect deceptive behaviour in **Qwen/Qwen3.6-27B:ai-safety-institute/Qwen3.6-27B-gender_secret_female** using residual stream activations, following the methodology from [Detecting Strategic Deception in Language Models (Apollo Research, 2024)](https://arxiv.org/abs/2405.09758). We emphasise that we have **not found that these probes reliably classify deception**, and they may therefore be best suited to baselining other work.
 ## Quick Start
 ```python
-from deception import get_probe_hf  # Will work when open sourced
-probe = get_probe_hf("ai-safety-institute/apollo-qwen-qwen3.6-27b__ai-safety-institute-qwen3.6-27b-gender_secret_female")
 ```
 The default checkpoint is the best performer from the hyperparameter sweep (`l_38_lm_500000_ar_lr.pt`). To pick a specific checkpoint, pass `filename=`:
 ```python
-probe = get_probe_hf("ai-safety-institute/apollo-qwen-qwen3.6-27b__ai-safety-institute-qwen3.6-27b-gender_secret_female", filename="l_40_ar_mlp_wd_0_001_lr_0_0001_ep_100.pt")
 ```
 Use `sweep.json` to see all 296 available checkpoints and their metrics.
 ## Model Details
@@ -60,10 +66,10 @@ Probes are trained on an **instructed pairs** dataset (model instructed to be de
 ### Trained Probes
 ```bibtex
-@misc{cooney2025deceptionprobes,
-      title={Apollo-Style Deception Probes},
-      author={Alan Cooney},
-      year={2025},
-      url={https://huggingface.co/collections/ai-safety-institute/apollo-style-deception-probes},
 }
 ```

 # Apollo-Style Deception Probe for Qwen/Qwen3.6-27B:ai-safety-institute/Qwen3.6-27B-gender_secret_female
+A probe trained to detect deceptive behaviour in **Qwen/Qwen3.6-27B:ai-safety-institute/Qwen3.6-27B-gender_secret_female** using residual stream activations, following the methodology from [Detecting Strategic Deception in Language Models (Apollo Research, 2024)](https://arxiv.org/abs/2405.09758).
 ## Quick Start
+```bash
+uv add lie-detectors        # or: pip install lie-detectors
+```
 ```python
+from lie_detectors import get_probe
+probe = get_probe("ai-safety-institute/apollo-qwen-qwen3.6-27b__ai-safety-institute-qwen3.6-27b-gender_secret_female")
 ```
 The default checkpoint is the best performer from the hyperparameter sweep (`l_38_lm_500000_ar_lr.pt`). To pick a specific checkpoint, pass `filename=`:
 ```python
+probe = get_probe("ai-safety-institute/apollo-qwen-qwen3.6-27b__ai-safety-institute-qwen3.6-27b-gender_secret_female", filename="l_40_ar_mlp_wd_0_001_lr_0_0001_ep_100.pt")
 ```
+See [UKGovernmentBEIS/lie_detectors](https://github.com/UKGovernmentBEIS/lie_detectors) for the loading library.
 Use `sweep.json` to see all 296 available checkpoints and their metrics.
 ## Model Details
 ### Trained Probes
 ```bibtex
+@misc{cooney2026liedetectors,
+      title={``Did you lie?'' Evaluating Lie Detectors across Model Scale and Belief-Verified Model Organisms},
+      author={Alan Cooney and David Africa and Geoffrey Irving},
+      year={2026},
+      month={May},
 }
 ```