---
language: en
license: apache-2.0
tags:
- activation-oracle
- qwen3
- lora
- peft
- interpretability
library_name: peft
pipeline_tag: text-generation
base_model: Qwen/Qwen3-8B
---

# Best v3 + steering 2.0×

Same recipe as [`ceselder/qwen3-8b-ao-v3-best`](https://huggingface.co/ceselder/qwen3-8b-ao-v3-best) (Sonnet conversational + concurrent multi-layer [21..25] + on-policy cot-v5 past_lens + lr=3e-5, 50M tokens), **with one change**: the post-injection residual norm is rescaled to 2.0× the original residual norm rather than the natural ~√2× (≈1.41×) that arises from the default norm-matched injection.

This corresponds to the **`multi5_sonnet_norm2p0`** training tag in the project.

## AObench

- Best v3 (4-seed mean): **+0.414**
- Best v3 + steering 2.0× (single seed at upload time): **+0.437** (Δ = +0.023 ↑)

A 3-seed mean with seeds {original, 7, 13} was in progress when the training box was decommissioned. This card will be updated if/when those additional seeds are run.

## What this is

LoRA verbalizer trained as part of the v3 ablation ladder for the Activation Oracle (AO) project.

The AO setup: given a target Qwen3-8B forward pass at certain layers/positions, we extract residual-stream activations and inject them (norm-matched) into a frozen Qwen3-8B's residual at a fixed hook layer. The verbalizer (this LoRA) is then trained to produce a natural-language description of what the captured activations represent.

## Files

- `adapter_model.safetensors` — LoRA weights (rank/alpha/dropout in adapter_config.json)
- `adapter_config.json` — PEFT config (target modules, rank, alpha)
- `ao_config.json` — Activation Oracle config (layers, hook positions, hook_onto_layer, prefix template)

## Usage

```python
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

base = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-8B")
tok = AutoTokenizer.from_pretrained("Qwen/Qwen3-8B")
model = PeftModel.from_pretrained(base, "ceselder/qwen3-8b-ao-v3-best-steering2p0")
```

To reproduce the inference-time steering 2.0× behaviour, set the environment variable `AO_FINAL_NORM_SCALE=2.0` when running the AO injection hook (see `nl_probes/utils/steering_hooks.py:get_hf_activation_steering_hook` in the project repo).

## Quirks worth knowing about

- **First-position injection is an implicit training anchor.** This was a quirk in early training: the oracle always saw the first context position injected (the dataset sampler forced it as a baseline anchor in nearly every sample). Presumably this helps with grounding. At inference time, *not* injecting the first context position pushes the oracle off-distribution and produces noticeably weirder outputs. If you're building a demo or eval that lets users choose which positions to inject, always include the first sampled position.

## Collection

This checkpoint is part of the **Qwen3-8B Activation Oracle v3 ablation ladder** collection.