Upload apollo probes for Qwen--Qwen3.6-27B@ai-safety-institute--Qwen3.6-27B-gender_secret_female
Browse files
README.md
CHANGED
|
@@ -9,22 +9,28 @@ license: mit
|
|
| 9 |
|
| 10 |
# Apollo-Style Deception Probe for Qwen/Qwen3.6-27B:ai-safety-institute/Qwen3.6-27B-gender_secret_female
|
| 11 |
|
| 12 |
-
A probe trained to detect deceptive behaviour in **Qwen/Qwen3.6-27B:ai-safety-institute/Qwen3.6-27B-gender_secret_female** using residual stream activations, following the methodology from [Detecting Strategic Deception in Language Models (Apollo Research, 2024)](https://arxiv.org/abs/2405.09758).
|
| 13 |
|
| 14 |
## Quick Start
|
| 15 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 16 |
```python
|
| 17 |
-
from
|
| 18 |
|
| 19 |
-
probe =
|
| 20 |
```
|
| 21 |
|
| 22 |
The default checkpoint is the best performer from the hyperparameter sweep (`l_38_lm_500000_ar_lr.pt`). To pick a specific checkpoint, pass `filename=`:
|
| 23 |
|
| 24 |
```python
|
| 25 |
-
probe =
|
| 26 |
```
|
| 27 |
|
|
|
|
|
|
|
| 28 |
Use `sweep.json` to see all 296 available checkpoints and their metrics.
|
| 29 |
|
| 30 |
## Model Details
|
|
@@ -60,10 +66,10 @@ Probes are trained on an **instructed pairs** dataset (model instructed to be de
|
|
| 60 |
### Trained Probes
|
| 61 |
|
| 62 |
```bibtex
|
| 63 |
-
@misc{
|
| 64 |
-
title={
|
| 65 |
-
author={Alan Cooney},
|
| 66 |
-
year={
|
| 67 |
-
|
| 68 |
}
|
| 69 |
```
|
|
|
|
| 9 |
|
| 10 |
# Apollo-Style Deception Probe for Qwen/Qwen3.6-27B:ai-safety-institute/Qwen3.6-27B-gender_secret_female
|
| 11 |
|
| 12 |
+
A probe trained to detect deceptive behaviour in **Qwen/Qwen3.6-27B:ai-safety-institute/Qwen3.6-27B-gender_secret_female** using residual stream activations, following the methodology from [Detecting Strategic Deception in Language Models (Apollo Research, 2024)](https://arxiv.org/abs/2405.09758).
|
| 13 |
|
| 14 |
## Quick Start
|
| 15 |
|
| 16 |
+
```bash
|
| 17 |
+
uv add lie-detectors # or: pip install lie-detectors
|
| 18 |
+
```
|
| 19 |
+
|
| 20 |
```python
|
| 21 |
+
from lie_detectors import get_probe
|
| 22 |
|
| 23 |
+
probe = get_probe("ai-safety-institute/apollo-qwen-qwen3.6-27b__ai-safety-institute-qwen3.6-27b-gender_secret_female")
|
| 24 |
```
|
| 25 |
|
| 26 |
The default checkpoint is the best performer from the hyperparameter sweep (`l_38_lm_500000_ar_lr.pt`). To pick a specific checkpoint, pass `filename=`:
|
| 27 |
|
| 28 |
```python
|
| 29 |
+
probe = get_probe("ai-safety-institute/apollo-qwen-qwen3.6-27b__ai-safety-institute-qwen3.6-27b-gender_secret_female", filename="l_40_ar_mlp_wd_0_001_lr_0_0001_ep_100.pt")
|
| 30 |
```
|
| 31 |
|
| 32 |
+
See [UKGovernmentBEIS/lie_detectors](https://github.com/UKGovernmentBEIS/lie_detectors) for the loading library.
|
| 33 |
+
|
| 34 |
Use `sweep.json` to see all 296 available checkpoints and their metrics.
|
| 35 |
|
| 36 |
## Model Details
|
|
|
|
| 66 |
### Trained Probes
|
| 67 |
|
| 68 |
```bibtex
|
| 69 |
+
@misc{cooney2026liedetectors,
|
| 70 |
+
title={``Did you lie?'' Evaluating Lie Detectors across Model Scale and Belief-Verified Model Organisms},
|
| 71 |
+
author={Alan Cooney and David Africa and Geoffrey Irving},
|
| 72 |
+
year={2026},
|
| 73 |
+
month={May},
|
| 74 |
}
|
| 75 |
```
|