ceselder
/

qwen3-8b-cot-oracle

+---
+license: apache-2.0
+base_model: Qwen/Qwen3-8B
+tags:
+  - activation-oracle
+  - chain-of-thought
+  - interpretability
+  - lora
+  - peft
+datasets:
+  - ceselder/qwen3-8b-math-cot-corpus
+library_name: peft
+---
+# CoT Oracle v2 — Qwen3-8B (Experimental)
+An activation oracle fine-tuned to analyze chain-of-thought reasoning traces by reading internal activations. This is an **early experimental checkpoint** with known data quality issues — see below.
+## What This Is
+This is a LoRA adapter for [Qwen/Qwen3-8B](https://huggingface.co/Qwen/Qwen3-8B) trained to read the model's own internal activations at CoT sentence boundaries and answer questions about the reasoning process. It builds on the [Activation Oracles](https://arxiv.org/abs/2512.15674) framework by Karvonen et al.
+The oracle is initialized from the pre-trained AO checkpoint ([adamkarvonen/checkpoints_latentqa_cls_past_lens_addition_Qwen3-8B](https://huggingface.co/adamkarvonen/checkpoints_latentqa_cls_past_lens_addition_Qwen3-8B)) and further fine-tuned on CoT-specific tasks.
+## Training Details
+- **Base model:** Qwen/Qwen3-8B (36 layers)
+- **Starting point:** Pre-trained AO checkpoint (context prediction + classification tasks)
+- **Training data:** ~100K examples from 200 math problems (100 MATH-500 + 100 GSM8K)
+  - 45K context prediction (PastLens-style)
+  - 15K sentence importance classification
+  - 15K sentence taxonomy classification
+  - 10K answer tracking (logit lens)
+  - 15K reasoning summary
+- **Hyperparameters:** lr=1e-5, batch_size=16, 1 epoch (6,218 steps), gradient checkpointing
+- **LoRA config:** rank 64, alpha 128, dropout 0.05, all-linear
+- **Hardware:** 1x H100 80GB, bf16, ~1.5 hours
+- **wandb:** [cot_oracle/runs/ejp28bev](https://wandb.ai/celestedeschamphelaere-personal/cot_oracle/runs/ejp28bev)
+- **Corpus:** [ceselder/qwen3-8b-math-cot-corpus](https://huggingface.co/datasets/ceselder/qwen3-8b-math-cot-corpus)
+## Results (Exact String Match, 100 eval items per task)
+| Step | context_pred | importance | taxonomy | answer_track | summary |
+|------|-------------|------------|----------|--------------|---------|
+| 0 | 11% | 0% | 0% | 0% | 0% |
+| 1000 | 9% | 20% | 52% | 0% | 100% |
+| 2000 | 15% | 48% | 60% | 0% | 100% |
+| 3000 | 14% | 47% | 60% | 0% | 100% |
+| 4000 | 13% | 47% | 65% | 0% | 100% |
+| 4500 | 12% | 48% | **72%** | 0% | 100% |
+| 5000 | 14% | 48% | 64% | 0% | 100% |
+| 6000 | 15% | 48% | 67% | 0% | 100% |
+| final | 15% | 48% | 66% | 0% | 100% |
+**Taxonomy** (8-class sentence type classification) is the strongest result at 65-72% accuracy (random baseline: 12.5%).
+**Importance** plateaued at ~48% (4-class, random baseline: 25%).
+## Known Issues — Read Before Using
+This is an honest accounting of what went wrong. We're publishing this for transparency and to save others from the same mistakes.
+### 1. Summary labels are useless (100% = memorized garbage)
+All 200 summary labels were identical: "The model performed step-by-step computation to arrive at the answer." The LLM-based label generation fell back to a generic template for every problem. The model memorized this single string perfectly, giving a misleading 100% accuracy. **This task provides zero signal.**
+### 2. Importance labels are badly skewed
+The importance labels used a fixed KL divergence threshold (>0.1 = "important") but 99.7% of sentences exceeded this threshold. The model was essentially trained on a near-constant label. The 48% accuracy with 4-class eval is hard to interpret because the training labels don't match the eval labels (the eval uses within-problem percentile ranking that was implemented after training started). **Fixed for v3.**
+### 3. Answer tracking never worked (0% accuracy)
+The target format includes exact probability values (e.g., "step 5: P(answer)=0.73") which are impossible to match with exact string comparison. The task concept is sound but the format needs simplification. **Needs redesign.**
+### 4. Context prediction barely improved over baseline
+Context prediction went from 11% to 15% over training. The pre-trained AO checkpoint already handles this task, so limited improvement is expected. The small corpus (200 problems) may also limit this.
+### 5. Unfaithfulness detection doesn't work
+When tested on authority bias and hint-following eval sets, the oracle gives **the same response for every item** regardless of whether the model was actually influenced. It reads the oracle prompt (which contains answer options) rather than the activations. This is expected — the training data contains no unfaithfulness-specific tasks. **The oracle cannot detect unfaithful reasoning in its current form.**
+### 6. Eval uses exact string match
+AO's eval framework uses exact string match, which is overly strict for open-ended responses. A sentence like "active_computation" would fail if the model outputs "Active computation" or "active computation step". Actual model capability may be somewhat higher than reported numbers suggest.
+## What Works
+- **Taxonomy classification genuinely works.** Given an activation at a CoT sentence boundary, the oracle can identify whether it's problem_setup, active_computation, self_checking, etc. at ~65-72% accuracy (5-6x random baseline).
+- **The activation oracle framework works.** LoRA injection, norm-matched addition at layer 1, on-the-fly activation collection — the plumbing is solid.
+- **Data shuffling matters.** v1 (unshuffled) showed fake "grokking" as the model encountered each task sequentially. v2 (shuffled) learns all tasks simultaneously without catastrophic forgetting.
+## Checkpoints
+This repo contains multiple checkpoints:
+- `step_1000/` through `step_6000/` — saved every 1000 steps
+- `final/` — end of training
+Peak taxonomy performance is at step 4500 (72%), but we only saved at step_4000 (65%) and step_5000 (64%). Step 5000 is recommended as a reasonable all-round checkpoint.
+## Usage
+```python
+from peft import PeftModel
+from transformers import AutoModelForCausalLM, AutoTokenizer
+import torch
+model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-8B", torch_dtype=torch.bfloat16, device_map="auto")
+model = PeftModel.from_pretrained(model, "ceselder/cot-oracle-8b-v2", subfolder="step_5000")
+tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-8B")
+# Use with model.disable_adapter() for base model inference
+# Use model.set_adapter("default") for oracle inference
+# See: https://github.com/ceselder/cot-oracle for full usage
+```
+## What's Next (v3)
+- Fixed importance labels (within-problem percentile ranking, 4 balanced tiers)
+- Synthesized unique summary labels per problem (from importance + taxonomy data)
+- Unfaithfulness-specific training tasks
+- Fuzzy eval scoring instead of exact string match
+- Larger corpus (>200 problems)
+## Citation
+This work builds on:
+- [Activation Oracles](https://arxiv.org/abs/2512.15674) (Karvonen et al., 2024)
+- [Thought Anchors](https://arxiv.org/abs/2506.19143) (Bogdan et al., 2025)
+- [Thought Branches](https://arxiv.org/abs/2510.27484) (Macar, Bogdan et al., 2025)