Best v3 + steering 2.0ร—

Same recipe as ceselder/qwen3-8b-ao-v3-best (Sonnet conversational + concurrent multi-layer [21..25] + on-policy cot-v5 past_lens + lr=3e-5, 50M tokens), with one change: the post-injection residual norm is rescaled to 2.0ร— the original residual norm rather than the natural ~โˆš2ร— (โ‰ˆ1.41ร—) that arises from the default norm-matched injection.

This corresponds to the multi5_sonnet_norm2p0 training tag in the project.

AObench

  • Best v3 (4-seed mean): +0.414
  • Best v3 + steering 2.0ร— (single seed at upload time): +0.437 (ฮ” = +0.023 โ†‘)

A 3-seed mean with seeds {original, 7, 13} was in progress when the training box was decommissioned. This card will be updated if/when those additional seeds are run.

What this is

LoRA verbalizer trained as part of the v3 ablation ladder for the Activation Oracle (AO) project.

The AO setup: given a target Qwen3-8B forward pass at certain layers/positions, we extract residual-stream activations and inject them (norm-matched) into a frozen Qwen3-8B's residual at a fixed hook layer. The verbalizer (this LoRA) is then trained to produce a natural-language description of what the captured activations represent.

Files

  • adapter_model.safetensors โ€” LoRA weights (rank/alpha/dropout in adapter_config.json)
  • adapter_config.json โ€” PEFT config (target modules, rank, alpha)
  • ao_config.json โ€” Activation Oracle config (layers, hook positions, hook_onto_layer, prefix template)

Usage

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

base = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-8B")
tok = AutoTokenizer.from_pretrained("Qwen/Qwen3-8B")
model = PeftModel.from_pretrained(base, "ceselder/qwen3-8b-ao-v3-best-steering2p0")

To reproduce the inference-time steering 2.0ร— behaviour, set the environment variable AO_FINAL_NORM_SCALE=2.0 when running the AO injection hook (see nl_probes/utils/steering_hooks.py:get_hf_activation_steering_hook in the project repo).

Quirks worth knowing about

  • First-position injection is an implicit training anchor. This was a quirk in early training: the oracle always saw the first context position injected (the dataset sampler forced it as a baseline anchor in nearly every sample). Presumably this helps with grounding. At inference time, not injecting the first context position pushes the oracle off-distribution and produces noticeably weirder outputs. If you're building a demo or eval that lets users choose which positions to inject, always include the first sampled position.

Collection

This checkpoint is part of the Qwen3-8B Activation Oracle v3 ablation ladder collection.

Downloads last month
74
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for ceselder/qwen3-8b-ao-v3-best-steering2p0

Finetuned
Qwen/Qwen3-8B
Adapter
(1448)
this model