Instructions to use ceselder/qwen3-8b-cot-oracle with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use ceselder/qwen3-8b-cot-oracle with PEFT:
Task type is invalid.
- Notebooks
- Google Colab
- Kaggle
Upload README.md with huggingface_hub
Browse files
README.md
ADDED
|
@@ -0,0 +1,124 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
base_model: Qwen/Qwen3-8B
|
| 4 |
+
tags:
|
| 5 |
+
- activation-oracle
|
| 6 |
+
- chain-of-thought
|
| 7 |
+
- interpretability
|
| 8 |
+
- lora
|
| 9 |
+
- peft
|
| 10 |
+
datasets:
|
| 11 |
+
- ceselder/qwen3-8b-math-cot-corpus
|
| 12 |
+
library_name: peft
|
| 13 |
+
---
|
| 14 |
+
|
| 15 |
+
# CoT Oracle v2 — Qwen3-8B (Experimental)
|
| 16 |
+
|
| 17 |
+
An activation oracle fine-tuned to analyze chain-of-thought reasoning traces by reading internal activations. This is an **early experimental checkpoint** with known data quality issues — see below.
|
| 18 |
+
|
| 19 |
+
## What This Is
|
| 20 |
+
|
| 21 |
+
This is a LoRA adapter for [Qwen/Qwen3-8B](https://huggingface.co/Qwen/Qwen3-8B) trained to read the model's own internal activations at CoT sentence boundaries and answer questions about the reasoning process. It builds on the [Activation Oracles](https://arxiv.org/abs/2512.15674) framework by Karvonen et al.
|
| 22 |
+
|
| 23 |
+
The oracle is initialized from the pre-trained AO checkpoint ([adamkarvonen/checkpoints_latentqa_cls_past_lens_addition_Qwen3-8B](https://huggingface.co/adamkarvonen/checkpoints_latentqa_cls_past_lens_addition_Qwen3-8B)) and further fine-tuned on CoT-specific tasks.
|
| 24 |
+
|
| 25 |
+
## Training Details
|
| 26 |
+
|
| 27 |
+
- **Base model:** Qwen/Qwen3-8B (36 layers)
|
| 28 |
+
- **Starting point:** Pre-trained AO checkpoint (context prediction + classification tasks)
|
| 29 |
+
- **Training data:** ~100K examples from 200 math problems (100 MATH-500 + 100 GSM8K)
|
| 30 |
+
- 45K context prediction (PastLens-style)
|
| 31 |
+
- 15K sentence importance classification
|
| 32 |
+
- 15K sentence taxonomy classification
|
| 33 |
+
- 10K answer tracking (logit lens)
|
| 34 |
+
- 15K reasoning summary
|
| 35 |
+
- **Hyperparameters:** lr=1e-5, batch_size=16, 1 epoch (6,218 steps), gradient checkpointing
|
| 36 |
+
- **LoRA config:** rank 64, alpha 128, dropout 0.05, all-linear
|
| 37 |
+
- **Hardware:** 1x H100 80GB, bf16, ~1.5 hours
|
| 38 |
+
- **wandb:** [cot_oracle/runs/ejp28bev](https://wandb.ai/celestedeschamphelaere-personal/cot_oracle/runs/ejp28bev)
|
| 39 |
+
- **Corpus:** [ceselder/qwen3-8b-math-cot-corpus](https://huggingface.co/datasets/ceselder/qwen3-8b-math-cot-corpus)
|
| 40 |
+
|
| 41 |
+
## Results (Exact String Match, 100 eval items per task)
|
| 42 |
+
|
| 43 |
+
| Step | context_pred | importance | taxonomy | answer_track | summary |
|
| 44 |
+
|------|-------------|------------|----------|--------------|---------|
|
| 45 |
+
| 0 | 11% | 0% | 0% | 0% | 0% |
|
| 46 |
+
| 1000 | 9% | 20% | 52% | 0% | 100% |
|
| 47 |
+
| 2000 | 15% | 48% | 60% | 0% | 100% |
|
| 48 |
+
| 3000 | 14% | 47% | 60% | 0% | 100% |
|
| 49 |
+
| 4000 | 13% | 47% | 65% | 0% | 100% |
|
| 50 |
+
| 4500 | 12% | 48% | **72%** | 0% | 100% |
|
| 51 |
+
| 5000 | 14% | 48% | 64% | 0% | 100% |
|
| 52 |
+
| 6000 | 15% | 48% | 67% | 0% | 100% |
|
| 53 |
+
| final | 15% | 48% | 66% | 0% | 100% |
|
| 54 |
+
|
| 55 |
+
**Taxonomy** (8-class sentence type classification) is the strongest result at 65-72% accuracy (random baseline: 12.5%).
|
| 56 |
+
|
| 57 |
+
**Importance** plateaued at ~48% (4-class, random baseline: 25%).
|
| 58 |
+
|
| 59 |
+
## Known Issues — Read Before Using
|
| 60 |
+
|
| 61 |
+
This is an honest accounting of what went wrong. We're publishing this for transparency and to save others from the same mistakes.
|
| 62 |
+
|
| 63 |
+
### 1. Summary labels are useless (100% = memorized garbage)
|
| 64 |
+
All 200 summary labels were identical: "The model performed step-by-step computation to arrive at the answer." The LLM-based label generation fell back to a generic template for every problem. The model memorized this single string perfectly, giving a misleading 100% accuracy. **This task provides zero signal.**
|
| 65 |
+
|
| 66 |
+
### 2. Importance labels are badly skewed
|
| 67 |
+
The importance labels used a fixed KL divergence threshold (>0.1 = "important") but 99.7% of sentences exceeded this threshold. The model was essentially trained on a near-constant label. The 48% accuracy with 4-class eval is hard to interpret because the training labels don't match the eval labels (the eval uses within-problem percentile ranking that was implemented after training started). **Fixed for v3.**
|
| 68 |
+
|
| 69 |
+
### 3. Answer tracking never worked (0% accuracy)
|
| 70 |
+
The target format includes exact probability values (e.g., "step 5: P(answer)=0.73") which are impossible to match with exact string comparison. The task concept is sound but the format needs simplification. **Needs redesign.**
|
| 71 |
+
|
| 72 |
+
### 4. Context prediction barely improved over baseline
|
| 73 |
+
Context prediction went from 11% to 15% over training. The pre-trained AO checkpoint already handles this task, so limited improvement is expected. The small corpus (200 problems) may also limit this.
|
| 74 |
+
|
| 75 |
+
### 5. Unfaithfulness detection doesn't work
|
| 76 |
+
When tested on authority bias and hint-following eval sets, the oracle gives **the same response for every item** regardless of whether the model was actually influenced. It reads the oracle prompt (which contains answer options) rather than the activations. This is expected — the training data contains no unfaithfulness-specific tasks. **The oracle cannot detect unfaithful reasoning in its current form.**
|
| 77 |
+
|
| 78 |
+
### 6. Eval uses exact string match
|
| 79 |
+
AO's eval framework uses exact string match, which is overly strict for open-ended responses. A sentence like "active_computation" would fail if the model outputs "Active computation" or "active computation step". Actual model capability may be somewhat higher than reported numbers suggest.
|
| 80 |
+
|
| 81 |
+
## What Works
|
| 82 |
+
|
| 83 |
+
- **Taxonomy classification genuinely works.** Given an activation at a CoT sentence boundary, the oracle can identify whether it's problem_setup, active_computation, self_checking, etc. at ~65-72% accuracy (5-6x random baseline).
|
| 84 |
+
- **The activation oracle framework works.** LoRA injection, norm-matched addition at layer 1, on-the-fly activation collection — the plumbing is solid.
|
| 85 |
+
- **Data shuffling matters.** v1 (unshuffled) showed fake "grokking" as the model encountered each task sequentially. v2 (shuffled) learns all tasks simultaneously without catastrophic forgetting.
|
| 86 |
+
|
| 87 |
+
## Checkpoints
|
| 88 |
+
|
| 89 |
+
This repo contains multiple checkpoints:
|
| 90 |
+
- `step_1000/` through `step_6000/` — saved every 1000 steps
|
| 91 |
+
- `final/` — end of training
|
| 92 |
+
|
| 93 |
+
Peak taxonomy performance is at step 4500 (72%), but we only saved at step_4000 (65%) and step_5000 (64%). Step 5000 is recommended as a reasonable all-round checkpoint.
|
| 94 |
+
|
| 95 |
+
## Usage
|
| 96 |
+
|
| 97 |
+
```python
|
| 98 |
+
from peft import PeftModel
|
| 99 |
+
from transformers import AutoModelForCausalLM, AutoTokenizer
|
| 100 |
+
import torch
|
| 101 |
+
|
| 102 |
+
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-8B", torch_dtype=torch.bfloat16, device_map="auto")
|
| 103 |
+
model = PeftModel.from_pretrained(model, "ceselder/cot-oracle-8b-v2", subfolder="step_5000")
|
| 104 |
+
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-8B")
|
| 105 |
+
|
| 106 |
+
# Use with model.disable_adapter() for base model inference
|
| 107 |
+
# Use model.set_adapter("default") for oracle inference
|
| 108 |
+
# See: https://github.com/ceselder/cot-oracle for full usage
|
| 109 |
+
```
|
| 110 |
+
|
| 111 |
+
## What's Next (v3)
|
| 112 |
+
|
| 113 |
+
- Fixed importance labels (within-problem percentile ranking, 4 balanced tiers)
|
| 114 |
+
- Synthesized unique summary labels per problem (from importance + taxonomy data)
|
| 115 |
+
- Unfaithfulness-specific training tasks
|
| 116 |
+
- Fuzzy eval scoring instead of exact string match
|
| 117 |
+
- Larger corpus (>200 problems)
|
| 118 |
+
|
| 119 |
+
## Citation
|
| 120 |
+
|
| 121 |
+
This work builds on:
|
| 122 |
+
- [Activation Oracles](https://arxiv.org/abs/2512.15674) (Karvonen et al., 2024)
|
| 123 |
+
- [Thought Anchors](https://arxiv.org/abs/2506.19143) (Bogdan et al., 2025)
|
| 124 |
+
- [Thought Branches](https://arxiv.org/abs/2510.27484) (Macar, Bogdan et al., 2025)
|