--- language: en license: apache-2.0 tags: - activation-oracle - qwen3 - lora - peft - interpretability library_name: peft pipeline_tag: text-generation base_model: Qwen/Qwen3-8B --- # Best v3 + steering 2.0× Same recipe as [`ceselder/qwen3-8b-ao-v3-best`](https://huggingface.co/ceselder/qwen3-8b-ao-v3-best) (Sonnet conversational + concurrent multi-layer [21..25] + on-policy cot-v5 past_lens + lr=3e-5, 50M tokens), **with one change**: the post-injection residual norm is rescaled to 2.0× the original residual norm rather than the natural ~√2× (≈1.41×) that arises from the default norm-matched injection. This corresponds to the **`multi5_sonnet_norm2p0`** training tag in the project. ## AObench - Best v3 (4-seed mean): **+0.414** - Best v3 + steering 2.0× (single seed at upload time): **+0.437** (Δ = +0.023 ↑) A 3-seed mean with seeds {original, 7, 13} was in progress when the training box was decommissioned. This card will be updated if/when those additional seeds are run. ## What this is LoRA verbalizer trained as part of the v3 ablation ladder for the Activation Oracle (AO) project. The AO setup: given a target Qwen3-8B forward pass at certain layers/positions, we extract residual-stream activations and inject them (norm-matched) into a frozen Qwen3-8B's residual at a fixed hook layer. The verbalizer (this LoRA) is then trained to produce a natural-language description of what the captured activations represent. ## Files - `adapter_model.safetensors` — LoRA weights (rank/alpha/dropout in adapter_config.json) - `adapter_config.json` — PEFT config (target modules, rank, alpha) - `ao_config.json` — Activation Oracle config (layers, hook positions, hook_onto_layer, prefix template) ## Usage ```python from peft import PeftModel from transformers import AutoModelForCausalLM, AutoTokenizer base = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-8B") tok = AutoTokenizer.from_pretrained("Qwen/Qwen3-8B") model = PeftModel.from_pretrained(base, "ceselder/qwen3-8b-ao-v3-best-steering2p0") ``` To reproduce the inference-time steering 2.0× behaviour, set the environment variable `AO_FINAL_NORM_SCALE=2.0` when running the AO injection hook (see `nl_probes/utils/steering_hooks.py:get_hf_activation_steering_hook` in the project repo). ## Quirks worth knowing about - **First-position injection is an implicit training anchor.** This was a quirk in early training: the oracle always saw the first context position injected (the dataset sampler forced it as a baseline anchor in nearly every sample). Presumably this helps with grounding. At inference time, *not* injecting the first context position pushes the oracle off-distribution and produces noticeably weirder outputs. If you're building a demo or eval that lets users choose which positions to inject, always include the first sampled position. ## Collection This checkpoint is part of the **Qwen3-8B Activation Oracle v3 ablation ladder** collection.