Best v3 + steering 2.0×

Same recipe as ceselder/qwen3-8b-ao-v3-best (Sonnet conversational + concurrent multi-layer [21..25] + on-policy cot-v5 past_lens + lr=3e-5, 50M tokens), with one change: the post-injection residual norm is rescaled to 2.0× the original residual norm rather than the natural ~√2× (≈1.41×) that arises from the default norm-matched injection.

This corresponds to the multi5_sonnet_norm2p0 training tag in the project.

AObench

Best v3 (4-seed mean): +0.414
Best v3 + steering 2.0× (single seed at upload time): +0.437 (Δ = +0.023 ↑)

A 3-seed mean with seeds {original, 7, 13} was in progress when the training box was decommissioned. This card will be updated if/when those additional seeds are run.

What this is

LoRA verbalizer trained as part of the v3 ablation ladder for the Activation Oracle (AO) project.

The AO setup: given a target Qwen3-8B forward pass at certain layers/positions, we extract residual-stream activations and inject them (norm-matched) into a frozen Qwen3-8B's residual at a fixed hook layer. The verbalizer (this LoRA) is then trained to produce a natural-language description of what the captured activations represent.

Files

adapter_model.safetensors — LoRA weights (rank/alpha/dropout in adapter_config.json)
adapter_config.json — PEFT config (target modules, rank, alpha)
ao_config.json — Activation Oracle config (layers, hook positions, hook_onto_layer, prefix template)

Usage

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

base = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-8B")
tok = AutoTokenizer.from_pretrained("Qwen/Qwen3-8B")
model = PeftModel.from_pretrained(base, "ceselder/qwen3-8b-ao-v3-best-steering2p0")

To reproduce the inference-time steering 2.0× behaviour, set the environment variable AO_FINAL_NORM_SCALE=2.0 when running the AO injection hook (see nl_probes/utils/steering_hooks.py:get_hf_activation_steering_hook in the project repo).

Quirks worth knowing about

First-position injection is an implicit training anchor. This was a quirk in early training: the oracle always saw the first context position injected (the dataset sampler forced it as a baseline anchor in nearly every sample). Presumably this helps with grounding. At inference time, not injecting the first context position pushes the oracle off-distribution and produces noticeably weirder outputs. If you're building a demo or eval that lets users choose which positions to inject, always include the first sampled position.

Collection

This checkpoint is part of the Qwen3-8B Activation Oracle v3 ablation ladder collection.

Downloads last month: 74

Model tree for ceselder/qwen3-8b-ao-v3-best-steering2p0

Base model

Qwen/Qwen3-8B-Base

Finetuned

Qwen/Qwen3-8B

Adapter

(1448)

this model