nanoNLA — Qwen3-8B Activation Verbalizer (Layer 24, Karvonen injection)

The activation-verbalizer (AV) half of a Natural Language Autoencoder for Qwen3-8B, layer 24. Hand it a residual-stream activation vector and it describes, in natural language, what that activation represents.

This is the AV-SFT checkpoint — the supervised warm-start stage (before RL) — trained with the Karvonen norm-matched injection formula.


What this is

A Natural Language Autoencoder is a pair of fine-tuned LMs that map residual-stream activations to language and back:

direction model mechanism
activation → text AV (this model) inject the vector at a marker token, autoregress an explanation
text → activation AR truncated K+1-layer copy of the base + Linear(d, d) readout (reconstruction)

This repo hosts the AV for Qwen3-8B @ layer 24.

⚠️ This checkpoint uses the Karvonen injection formula

It was trained with the Karvonen norm-matched additive injection at the marker token (a layer-1 residual-stream hook):

h'_p = h_p + ‖h_p‖ · v / ‖v‖

where v is the layer-24 activation you want verbalized. Serve it with the same injectionnot the paper's embedding-replacement default. In the nanoNLA code this is the Karvonen path; see docs/qwen3_8b_run.md. The marker token, injection/MSE scales, and d_model are all read from the run's nla_meta.yaml sidecar — never hardcode them.

Usage

See docs/inference.md for the full harness (marker detection, neighbor-anchoring, the Karvonen hook). Sketch:

from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("ceselder/nanoNLA", torch_dtype="bfloat16")
tok   = AutoTokenizer.from_pretrained("ceselder/nanoNLA")
# 1. register the Karvonen layer-1 injection hook with your layer-24 activation `v`
# 2. put the marker token in the prompt (neighbor-anchored)
# 3. generate the explanation
# (full, correct hook + prompt template live in the nanoNLA repo)

Training

  • Base: Qwen/Qwen3-8B — fine-tuned; merged bf16 weights provided here
  • Layer: 24 (residual stream)
  • Stage: AV-SFT (warm-start), 1000 steps
  • Injection: Karvonen norm-matched additive (layer-1 hook)
  • Warm-start labels: explanations over qwen3-8b-nla-L24-finefineweb-100k
  • Peak LR: 3e-5 (cosine)

Exact recipe (rank/scales/hparams) and eval numbers are in the nanoNLA repo and docs/qwen3_8b_run.md.

Caveats

Research checkpoint, not a product. This is the SFT stage (pre-RL), so the AV produces plausible-but-imperfect explanations. Whether an NLA's explanations are faithful to the underlying computation is an open research question — see the paper.

Citation

@article{frasertaliente2026nla,
  title   = {Natural Language Autoencoders Produce Unsupervised Explanations of LLM Activations},
  author  = {Fraser-Taliente, others},
  journal = {Transformer Circuits Thread},
  year    = {2026},
  url     = {https://transformer-circuits.pub/2026/nla/index.html}
}
Downloads last month
144
Safetensors
Model size
8B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ceselder/nanonla-l24-av-qwen3-8b

Finetuned
Qwen/Qwen3-8B
Finetuned
(1719)
this model

Dataset used to train ceselder/nanonla-l24-av-qwen3-8b