---
license: apache-2.0
base_model: Qwen/Qwen3-8B
tags:
  - activation-oracle
  - chain-of-thought
  - interpretability
  - lora
  - peft
datasets:
  - ceselder/qwen3-8b-math-cot-corpus
library_name: peft
---

# CoT Oracle v2 — Qwen3-8B (Experimental)

An activation oracle fine-tuned to analyze chain-of-thought reasoning traces by reading internal activations. This is an **early experimental checkpoint** with known data quality issues — see below.

## What This Is

This is a LoRA adapter for [Qwen/Qwen3-8B](https://huggingface.co/Qwen/Qwen3-8B) trained to read the model's own internal activations at CoT sentence boundaries and answer questions about the reasoning process. It builds on the [Activation Oracles](https://arxiv.org/abs/2512.15674) framework by Karvonen et al.

The oracle is initialized from the pre-trained AO checkpoint ([adamkarvonen/checkpoints_latentqa_cls_past_lens_addition_Qwen3-8B](https://huggingface.co/adamkarvonen/checkpoints_latentqa_cls_past_lens_addition_Qwen3-8B)) and further fine-tuned on CoT-specific tasks.

## Training Details

- **Base model:** Qwen/Qwen3-8B (36 layers)
- **Starting point:** Pre-trained AO checkpoint (context prediction + classification tasks)
- **Training data:** ~100K examples from 200 math problems (100 MATH-500 + 100 GSM8K)
  - 45K context prediction (PastLens-style)
  - 15K sentence importance classification
  - 15K sentence taxonomy classification
  - 10K answer tracking (logit lens)
  - 15K reasoning summary
- **Hyperparameters:** lr=1e-5, batch_size=16, 1 epoch (6,218 steps), gradient checkpointing
- **LoRA config:** rank 64, alpha 128, dropout 0.05, all-linear
- **Hardware:** 1x H100 80GB, bf16, ~1.5 hours
- **wandb:** [cot_oracle/runs/ejp28bev](https://wandb.ai/celestedeschamphelaere-personal/cot_oracle/runs/ejp28bev)
- **Corpus:** [ceselder/qwen3-8b-math-cot-corpus](https://huggingface.co/datasets/ceselder/qwen3-8b-math-cot-corpus)

## Results (Exact String Match, 100 eval items per task)

| Step | context_pred | importance | taxonomy | answer_track | summary |
|------|-------------|------------|----------|--------------|---------|
| 0 | 11% | 0% | 0% | 0% | 0% |
| 1000 | 9% | 20% | 52% | 0% | 100% |
| 2000 | 15% | 48% | 60% | 0% | 100% |
| 3000 | 14% | 47% | 60% | 0% | 100% |
| 4000 | 13% | 47% | 65% | 0% | 100% |
| 4500 | 12% | 48% | **72%** | 0% | 100% |
| 5000 | 14% | 48% | 64% | 0% | 100% |
| 6000 | 15% | 48% | 67% | 0% | 100% |
| final | 15% | 48% | 66% | 0% | 100% |

**Taxonomy** (8-class sentence type classification) is the strongest result at 65-72% accuracy (random baseline: 12.5%).

**Importance** plateaued at ~48% (4-class, random baseline: 25%).

## Known Issues — Read Before Using

This is an honest accounting of what went wrong. We're publishing this for transparency and to save others from the same mistakes.

### 1. Summary labels are useless (100% = memorized garbage)
All 200 summary labels were identical: "The model performed step-by-step computation to arrive at the answer." The LLM-based label generation fell back to a generic template for every problem. The model memorized this single string perfectly, giving a misleading 100% accuracy. **This task provides zero signal.**

### 2. Importance labels are badly skewed
The importance labels used a fixed KL divergence threshold (>0.1 = "important") but 99.7% of sentences exceeded this threshold. The model was essentially trained on a near-constant label. The 48% accuracy with 4-class eval is hard to interpret because the training labels don't match the eval labels (the eval uses within-problem percentile ranking that was implemented after training started). **Fixed for v3.**

### 3. Answer tracking never worked (0% accuracy)
The target format includes exact probability values (e.g., "step 5: P(answer)=0.73") which are impossible to match with exact string comparison. The task concept is sound but the format needs simplification. **Needs redesign.**

### 4. Context prediction barely improved over baseline
Context prediction went from 11% to 15% over training. The pre-trained AO checkpoint already handles this task, so limited improvement is expected. The small corpus (200 problems) may also limit this.

### 5. Unfaithfulness detection doesn't work
When tested on authority bias and hint-following eval sets, the oracle gives **the same response for every item** regardless of whether the model was actually influenced. It reads the oracle prompt (which contains answer options) rather than the activations. This is expected — the training data contains no unfaithfulness-specific tasks. **The oracle cannot detect unfaithful reasoning in its current form.**

### 6. Eval uses exact string match
AO's eval framework uses exact string match, which is overly strict for open-ended responses. A sentence like "active_computation" would fail if the model outputs "Active computation" or "active computation step". Actual model capability may be somewhat higher than reported numbers suggest.

## What Works

- **Taxonomy classification genuinely works.** Given an activation at a CoT sentence boundary, the oracle can identify whether it's problem_setup, active_computation, self_checking, etc. at ~65-72% accuracy (5-6x random baseline).
- **The activation oracle framework works.** LoRA injection, norm-matched addition at layer 1, on-the-fly activation collection — the plumbing is solid.
- **Data shuffling matters.** v1 (unshuffled) showed fake "grokking" as the model encountered each task sequentially. v2 (shuffled) learns all tasks simultaneously without catastrophic forgetting.

## Checkpoints

This repo contains multiple checkpoints:
- `step_1000/` through `step_6000/` — saved every 1000 steps
- `final/` — end of training

Peak taxonomy performance is at step 4500 (72%), but we only saved at step_4000 (65%) and step_5000 (64%). Step 5000 is recommended as a reasonable all-round checkpoint.

## Usage

```python
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-8B", torch_dtype=torch.bfloat16, device_map="auto")
model = PeftModel.from_pretrained(model, "ceselder/cot-oracle-8b-v2", subfolder="step_5000")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-8B")

# Use with model.disable_adapter() for base model inference
# Use model.set_adapter("default") for oracle inference
# See: https://github.com/ceselder/cot-oracle for full usage
```

## What's Next (v3)

- Fixed importance labels (within-problem percentile ranking, 4 balanced tiers)
- Synthesized unique summary labels per problem (from importance + taxonomy data)
- Unfaithfulness-specific training tasks
- Fuzzy eval scoring instead of exact string match
- Larger corpus (>200 problems)

## Citation

This work builds on:
- [Activation Oracles](https://arxiv.org/abs/2512.15674) (Karvonen et al., 2024)
- [Thought Anchors](https://arxiv.org/abs/2506.19143) (Bogdan et al., 2025)
- [Thought Branches](https://arxiv.org/abs/2510.27484) (Macar, Bogdan et al., 2025)