--- license: apache-2.0 base_model: Qwen/Qwen3-8B tags: - activation-oracle - chain-of-thought - interpretability - lora - peft datasets: - ceselder/qwen3-8b-math-cot-corpus library_name: peft --- # CoT Oracle v2 — Qwen3-8B (Experimental) An activation oracle fine-tuned to analyze chain-of-thought reasoning traces by reading internal activations. This is an **early experimental checkpoint** with known data quality issues — see below. ## What This Is This is a LoRA adapter for [Qwen/Qwen3-8B](https://huggingface.co/Qwen/Qwen3-8B) trained to read the model's own internal activations at CoT sentence boundaries and answer questions about the reasoning process. It builds on the [Activation Oracles](https://arxiv.org/abs/2512.15674) framework by Karvonen et al. The oracle is initialized from the pre-trained AO checkpoint ([adamkarvonen/checkpoints_latentqa_cls_past_lens_addition_Qwen3-8B](https://huggingface.co/adamkarvonen/checkpoints_latentqa_cls_past_lens_addition_Qwen3-8B)) and further fine-tuned on CoT-specific tasks. ## Training Details - **Base model:** Qwen/Qwen3-8B (36 layers) - **Starting point:** Pre-trained AO checkpoint (context prediction + classification tasks) - **Training data:** ~100K examples from 200 math problems (100 MATH-500 + 100 GSM8K) - 45K context prediction (PastLens-style) - 15K sentence importance classification - 15K sentence taxonomy classification - 10K answer tracking (logit lens) - 15K reasoning summary - **Hyperparameters:** lr=1e-5, batch_size=16, 1 epoch (6,218 steps), gradient checkpointing - **LoRA config:** rank 64, alpha 128, dropout 0.05, all-linear - **Hardware:** 1x H100 80GB, bf16, ~1.5 hours - **wandb:** [cot_oracle/runs/ejp28bev](https://wandb.ai/celestedeschamphelaere-personal/cot_oracle/runs/ejp28bev) - **Corpus:** [ceselder/qwen3-8b-math-cot-corpus](https://huggingface.co/datasets/ceselder/qwen3-8b-math-cot-corpus) ## Results (Exact String Match, 100 eval items per task) | Step | context_pred | importance | taxonomy | answer_track | summary | |------|-------------|------------|----------|--------------|---------| | 0 | 11% | 0% | 0% | 0% | 0% | | 1000 | 9% | 20% | 52% | 0% | 100% | | 2000 | 15% | 48% | 60% | 0% | 100% | | 3000 | 14% | 47% | 60% | 0% | 100% | | 4000 | 13% | 47% | 65% | 0% | 100% | | 4500 | 12% | 48% | **72%** | 0% | 100% | | 5000 | 14% | 48% | 64% | 0% | 100% | | 6000 | 15% | 48% | 67% | 0% | 100% | | final | 15% | 48% | 66% | 0% | 100% | **Taxonomy** (8-class sentence type classification) is the strongest result at 65-72% accuracy (random baseline: 12.5%). **Importance** plateaued at ~48% (4-class, random baseline: 25%). ## Known Issues — Read Before Using This is an honest accounting of what went wrong. We're publishing this for transparency and to save others from the same mistakes. ### 1. Summary labels are useless (100% = memorized garbage) All 200 summary labels were identical: "The model performed step-by-step computation to arrive at the answer." The LLM-based label generation fell back to a generic template for every problem. The model memorized this single string perfectly, giving a misleading 100% accuracy. **This task provides zero signal.** ### 2. Importance labels are badly skewed The importance labels used a fixed KL divergence threshold (>0.1 = "important") but 99.7% of sentences exceeded this threshold. The model was essentially trained on a near-constant label. The 48% accuracy with 4-class eval is hard to interpret because the training labels don't match the eval labels (the eval uses within-problem percentile ranking that was implemented after training started). **Fixed for v3.** ### 3. Answer tracking never worked (0% accuracy) The target format includes exact probability values (e.g., "step 5: P(answer)=0.73") which are impossible to match with exact string comparison. The task concept is sound but the format needs simplification. **Needs redesign.** ### 4. Context prediction barely improved over baseline Context prediction went from 11% to 15% over training. The pre-trained AO checkpoint already handles this task, so limited improvement is expected. The small corpus (200 problems) may also limit this. ### 5. Unfaithfulness detection doesn't work When tested on authority bias and hint-following eval sets, the oracle gives **the same response for every item** regardless of whether the model was actually influenced. It reads the oracle prompt (which contains answer options) rather than the activations. This is expected — the training data contains no unfaithfulness-specific tasks. **The oracle cannot detect unfaithful reasoning in its current form.** ### 6. Eval uses exact string match AO's eval framework uses exact string match, which is overly strict for open-ended responses. A sentence like "active_computation" would fail if the model outputs "Active computation" or "active computation step". Actual model capability may be somewhat higher than reported numbers suggest. ## What Works - **Taxonomy classification genuinely works.** Given an activation at a CoT sentence boundary, the oracle can identify whether it's problem_setup, active_computation, self_checking, etc. at ~65-72% accuracy (5-6x random baseline). - **The activation oracle framework works.** LoRA injection, norm-matched addition at layer 1, on-the-fly activation collection — the plumbing is solid. - **Data shuffling matters.** v1 (unshuffled) showed fake "grokking" as the model encountered each task sequentially. v2 (shuffled) learns all tasks simultaneously without catastrophic forgetting. ## Checkpoints This repo contains multiple checkpoints: - `step_1000/` through `step_6000/` — saved every 1000 steps - `final/` — end of training Peak taxonomy performance is at step 4500 (72%), but we only saved at step_4000 (65%) and step_5000 (64%). Step 5000 is recommended as a reasonable all-round checkpoint. ## Usage ```python from peft import PeftModel from transformers import AutoModelForCausalLM, AutoTokenizer import torch model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-8B", torch_dtype=torch.bfloat16, device_map="auto") model = PeftModel.from_pretrained(model, "ceselder/cot-oracle-8b-v2", subfolder="step_5000") tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-8B") # Use with model.disable_adapter() for base model inference # Use model.set_adapter("default") for oracle inference # See: https://github.com/ceselder/cot-oracle for full usage ``` ## What's Next (v3) - Fixed importance labels (within-problem percentile ranking, 4 balanced tiers) - Synthesized unique summary labels per problem (from importance + taxonomy data) - Unfaithfulness-specific training tasks - Fuzzy eval scoring instead of exact string match - Larger corpus (>200 problems) ## Citation This work builds on: - [Activation Oracles](https://arxiv.org/abs/2512.15674) (Karvonen et al., 2024) - [Thought Anchors](https://arxiv.org/abs/2506.19143) (Bogdan et al., 2025) - [Thought Branches](https://arxiv.org/abs/2510.27484) (Macar, Bogdan et al., 2025)