PEFT
Safetensors
activation-oracle
chain-of-thought
interpretability
lora
ceselder commited on
Commit
6668445
·
verified ·
1 Parent(s): 9ad099b

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +124 -0
README.md ADDED
@@ -0,0 +1,124 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ base_model: Qwen/Qwen3-8B
4
+ tags:
5
+ - activation-oracle
6
+ - chain-of-thought
7
+ - interpretability
8
+ - lora
9
+ - peft
10
+ datasets:
11
+ - ceselder/qwen3-8b-math-cot-corpus
12
+ library_name: peft
13
+ ---
14
+
15
+ # CoT Oracle v2 — Qwen3-8B (Experimental)
16
+
17
+ An activation oracle fine-tuned to analyze chain-of-thought reasoning traces by reading internal activations. This is an **early experimental checkpoint** with known data quality issues — see below.
18
+
19
+ ## What This Is
20
+
21
+ This is a LoRA adapter for [Qwen/Qwen3-8B](https://huggingface.co/Qwen/Qwen3-8B) trained to read the model's own internal activations at CoT sentence boundaries and answer questions about the reasoning process. It builds on the [Activation Oracles](https://arxiv.org/abs/2512.15674) framework by Karvonen et al.
22
+
23
+ The oracle is initialized from the pre-trained AO checkpoint ([adamkarvonen/checkpoints_latentqa_cls_past_lens_addition_Qwen3-8B](https://huggingface.co/adamkarvonen/checkpoints_latentqa_cls_past_lens_addition_Qwen3-8B)) and further fine-tuned on CoT-specific tasks.
24
+
25
+ ## Training Details
26
+
27
+ - **Base model:** Qwen/Qwen3-8B (36 layers)
28
+ - **Starting point:** Pre-trained AO checkpoint (context prediction + classification tasks)
29
+ - **Training data:** ~100K examples from 200 math problems (100 MATH-500 + 100 GSM8K)
30
+ - 45K context prediction (PastLens-style)
31
+ - 15K sentence importance classification
32
+ - 15K sentence taxonomy classification
33
+ - 10K answer tracking (logit lens)
34
+ - 15K reasoning summary
35
+ - **Hyperparameters:** lr=1e-5, batch_size=16, 1 epoch (6,218 steps), gradient checkpointing
36
+ - **LoRA config:** rank 64, alpha 128, dropout 0.05, all-linear
37
+ - **Hardware:** 1x H100 80GB, bf16, ~1.5 hours
38
+ - **wandb:** [cot_oracle/runs/ejp28bev](https://wandb.ai/celestedeschamphelaere-personal/cot_oracle/runs/ejp28bev)
39
+ - **Corpus:** [ceselder/qwen3-8b-math-cot-corpus](https://huggingface.co/datasets/ceselder/qwen3-8b-math-cot-corpus)
40
+
41
+ ## Results (Exact String Match, 100 eval items per task)
42
+
43
+ | Step | context_pred | importance | taxonomy | answer_track | summary |
44
+ |------|-------------|------------|----------|--------------|---------|
45
+ | 0 | 11% | 0% | 0% | 0% | 0% |
46
+ | 1000 | 9% | 20% | 52% | 0% | 100% |
47
+ | 2000 | 15% | 48% | 60% | 0% | 100% |
48
+ | 3000 | 14% | 47% | 60% | 0% | 100% |
49
+ | 4000 | 13% | 47% | 65% | 0% | 100% |
50
+ | 4500 | 12% | 48% | **72%** | 0% | 100% |
51
+ | 5000 | 14% | 48% | 64% | 0% | 100% |
52
+ | 6000 | 15% | 48% | 67% | 0% | 100% |
53
+ | final | 15% | 48% | 66% | 0% | 100% |
54
+
55
+ **Taxonomy** (8-class sentence type classification) is the strongest result at 65-72% accuracy (random baseline: 12.5%).
56
+
57
+ **Importance** plateaued at ~48% (4-class, random baseline: 25%).
58
+
59
+ ## Known Issues — Read Before Using
60
+
61
+ This is an honest accounting of what went wrong. We're publishing this for transparency and to save others from the same mistakes.
62
+
63
+ ### 1. Summary labels are useless (100% = memorized garbage)
64
+ All 200 summary labels were identical: "The model performed step-by-step computation to arrive at the answer." The LLM-based label generation fell back to a generic template for every problem. The model memorized this single string perfectly, giving a misleading 100% accuracy. **This task provides zero signal.**
65
+
66
+ ### 2. Importance labels are badly skewed
67
+ The importance labels used a fixed KL divergence threshold (>0.1 = "important") but 99.7% of sentences exceeded this threshold. The model was essentially trained on a near-constant label. The 48% accuracy with 4-class eval is hard to interpret because the training labels don't match the eval labels (the eval uses within-problem percentile ranking that was implemented after training started). **Fixed for v3.**
68
+
69
+ ### 3. Answer tracking never worked (0% accuracy)
70
+ The target format includes exact probability values (e.g., "step 5: P(answer)=0.73") which are impossible to match with exact string comparison. The task concept is sound but the format needs simplification. **Needs redesign.**
71
+
72
+ ### 4. Context prediction barely improved over baseline
73
+ Context prediction went from 11% to 15% over training. The pre-trained AO checkpoint already handles this task, so limited improvement is expected. The small corpus (200 problems) may also limit this.
74
+
75
+ ### 5. Unfaithfulness detection doesn't work
76
+ When tested on authority bias and hint-following eval sets, the oracle gives **the same response for every item** regardless of whether the model was actually influenced. It reads the oracle prompt (which contains answer options) rather than the activations. This is expected — the training data contains no unfaithfulness-specific tasks. **The oracle cannot detect unfaithful reasoning in its current form.**
77
+
78
+ ### 6. Eval uses exact string match
79
+ AO's eval framework uses exact string match, which is overly strict for open-ended responses. A sentence like "active_computation" would fail if the model outputs "Active computation" or "active computation step". Actual model capability may be somewhat higher than reported numbers suggest.
80
+
81
+ ## What Works
82
+
83
+ - **Taxonomy classification genuinely works.** Given an activation at a CoT sentence boundary, the oracle can identify whether it's problem_setup, active_computation, self_checking, etc. at ~65-72% accuracy (5-6x random baseline).
84
+ - **The activation oracle framework works.** LoRA injection, norm-matched addition at layer 1, on-the-fly activation collection — the plumbing is solid.
85
+ - **Data shuffling matters.** v1 (unshuffled) showed fake "grokking" as the model encountered each task sequentially. v2 (shuffled) learns all tasks simultaneously without catastrophic forgetting.
86
+
87
+ ## Checkpoints
88
+
89
+ This repo contains multiple checkpoints:
90
+ - `step_1000/` through `step_6000/` — saved every 1000 steps
91
+ - `final/` — end of training
92
+
93
+ Peak taxonomy performance is at step 4500 (72%), but we only saved at step_4000 (65%) and step_5000 (64%). Step 5000 is recommended as a reasonable all-round checkpoint.
94
+
95
+ ## Usage
96
+
97
+ ```python
98
+ from peft import PeftModel
99
+ from transformers import AutoModelForCausalLM, AutoTokenizer
100
+ import torch
101
+
102
+ model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-8B", torch_dtype=torch.bfloat16, device_map="auto")
103
+ model = PeftModel.from_pretrained(model, "ceselder/cot-oracle-8b-v2", subfolder="step_5000")
104
+ tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-8B")
105
+
106
+ # Use with model.disable_adapter() for base model inference
107
+ # Use model.set_adapter("default") for oracle inference
108
+ # See: https://github.com/ceselder/cot-oracle for full usage
109
+ ```
110
+
111
+ ## What's Next (v3)
112
+
113
+ - Fixed importance labels (within-problem percentile ranking, 4 balanced tiers)
114
+ - Synthesized unique summary labels per problem (from importance + taxonomy data)
115
+ - Unfaithfulness-specific training tasks
116
+ - Fuzzy eval scoring instead of exact string match
117
+ - Larger corpus (>200 problems)
118
+
119
+ ## Citation
120
+
121
+ This work builds on:
122
+ - [Activation Oracles](https://arxiv.org/abs/2512.15674) (Karvonen et al., 2024)
123
+ - [Thought Anchors](https://arxiv.org/abs/2506.19143) (Bogdan et al., 2025)
124
+ - [Thought Branches](https://arxiv.org/abs/2510.27484) (Macar, Bogdan et al., 2025)