alancooneydsit commited on
Commit
aaa45de
·
verified ·
1 Parent(s): 1385607

Upload apollo probes for Qwen--Qwen3.6-27B@ai-safety-institute--Qwen3.6-27B-gender_secret_female

Browse files
Files changed (1) hide show
  1. README.md +15 -9
README.md CHANGED
@@ -9,22 +9,28 @@ license: mit
9
 
10
  # Apollo-Style Deception Probe for Qwen/Qwen3.6-27B:ai-safety-institute/Qwen3.6-27B-gender_secret_female
11
 
12
- A probe trained to detect deceptive behaviour in **Qwen/Qwen3.6-27B:ai-safety-institute/Qwen3.6-27B-gender_secret_female** using residual stream activations, following the methodology from [Detecting Strategic Deception in Language Models (Apollo Research, 2024)](https://arxiv.org/abs/2405.09758). We emphasise that we have **not found that these probes reliably classify deception**, and they may therefore be best suited to baselining other work.
13
 
14
  ## Quick Start
15
 
 
 
 
 
16
  ```python
17
- from deception import get_probe_hf # Will work when open sourced
18
 
19
- probe = get_probe_hf("ai-safety-institute/apollo-qwen-qwen3.6-27b__ai-safety-institute-qwen3.6-27b-gender_secret_female")
20
  ```
21
 
22
  The default checkpoint is the best performer from the hyperparameter sweep (`l_38_lm_500000_ar_lr.pt`). To pick a specific checkpoint, pass `filename=`:
23
 
24
  ```python
25
- probe = get_probe_hf("ai-safety-institute/apollo-qwen-qwen3.6-27b__ai-safety-institute-qwen3.6-27b-gender_secret_female", filename="l_40_ar_mlp_wd_0_001_lr_0_0001_ep_100.pt")
26
  ```
27
 
 
 
28
  Use `sweep.json` to see all 296 available checkpoints and their metrics.
29
 
30
  ## Model Details
@@ -60,10 +66,10 @@ Probes are trained on an **instructed pairs** dataset (model instructed to be de
60
  ### Trained Probes
61
 
62
  ```bibtex
63
- @misc{cooney2025deceptionprobes,
64
- title={Apollo-Style Deception Probes},
65
- author={Alan Cooney},
66
- year={2025},
67
- url={https://huggingface.co/collections/ai-safety-institute/apollo-style-deception-probes},
68
  }
69
  ```
 
9
 
10
  # Apollo-Style Deception Probe for Qwen/Qwen3.6-27B:ai-safety-institute/Qwen3.6-27B-gender_secret_female
11
 
12
+ A probe trained to detect deceptive behaviour in **Qwen/Qwen3.6-27B:ai-safety-institute/Qwen3.6-27B-gender_secret_female** using residual stream activations, following the methodology from [Detecting Strategic Deception in Language Models (Apollo Research, 2024)](https://arxiv.org/abs/2405.09758).
13
 
14
  ## Quick Start
15
 
16
+ ```bash
17
+ uv add lie-detectors # or: pip install lie-detectors
18
+ ```
19
+
20
  ```python
21
+ from lie_detectors import get_probe
22
 
23
+ probe = get_probe("ai-safety-institute/apollo-qwen-qwen3.6-27b__ai-safety-institute-qwen3.6-27b-gender_secret_female")
24
  ```
25
 
26
  The default checkpoint is the best performer from the hyperparameter sweep (`l_38_lm_500000_ar_lr.pt`). To pick a specific checkpoint, pass `filename=`:
27
 
28
  ```python
29
+ probe = get_probe("ai-safety-institute/apollo-qwen-qwen3.6-27b__ai-safety-institute-qwen3.6-27b-gender_secret_female", filename="l_40_ar_mlp_wd_0_001_lr_0_0001_ep_100.pt")
30
  ```
31
 
32
+ See [UKGovernmentBEIS/lie_detectors](https://github.com/UKGovernmentBEIS/lie_detectors) for the loading library.
33
+
34
  Use `sweep.json` to see all 296 available checkpoints and their metrics.
35
 
36
  ## Model Details
 
66
  ### Trained Probes
67
 
68
  ```bibtex
69
+ @misc{cooney2026liedetectors,
70
+ title={``Did you lie?'' Evaluating Lie Detectors across Model Scale and Belief-Verified Model Organisms},
71
+ author={Alan Cooney and David Africa and Geoffrey Irving},
72
+ year={2026},
73
+ month={May},
74
  }
75
  ```