tylerxdurden
/

PolyChromaticLM-1.0-instruct-0.6B

Text Generation

activation-routing

Eval Results (legacy)

Model card Files Files and versions

tylerxdurden commited on Mar 7

Commit

3f0c8ee

·

verified ·

1 Parent(s): 4ac133d

Upload README.md with huggingface_hub

Files changed (1) hide show

README.md +17 -15

README.md CHANGED Viewed

@@ -190,21 +190,23 @@ SFT modifies *what* is computed, not *how*: the routing mechanism (which activat
 All benchmarks via [EleutherAI lm-evaluation-harness](https://github.com/EleutherAI/lm-eval) v0.4.11, 0-shot unless noted.
-### Benchmarks (Base vs SFT)
-| Benchmark | Metric | Base | SFT | Delta | Random |
-|-----------|--------|-----:|----:|------:|-------:|
-| **HellaSwag** | acc_norm | 28.51 | 27.84 | -0.67 | 25.00 |
-| **ARC-Easy** | acc_norm | 41.04 | 36.11 | -4.93 | 25.00 |
-| **ARC-Challenge** | acc_norm | 22.27 | 24.15 | +1.88 | 25.00 |
-| **PIQA** | acc_norm | 58.87 | 54.52 | -4.35 | 50.00 |
-| **WinoGrande** | acc | 52.17 | 52.72 | +0.55 | 50.00 |
-| **BoolQ** | acc | 61.13 | 55.63 | -5.50 | 50.00 |
-| **MMLU-STEM** | acc (5-shot) | 25.28 | 28.42 | **+3.14** | 25.00 |
-| **LAMBADA** | acc | 15.35 | 7.01 | -8.34 | ~0 |
-| **OpenBookQA** | acc_norm | 29.00 | 26.80 | -2.20 | 25.00 |
-| **SciQ** | acc_norm | 61.20 | 52.70 | -8.50 | 25.00 |
-| **Mean** | | 39.48 | 36.59 | **-2.89** | |
 <div align="center">
 <img src="figures/sft_base_vs_sft_benchmarks.png" alt="Base vs SFT benchmark comparison" width="80%">

 All benchmarks via [EleutherAI lm-evaluation-harness](https://github.com/EleutherAI/lm-eval) v0.4.11, 0-shot unless noted.
+### Benchmarks (Base vs SFT vs Qwen3-0.6B-Base)
+| Benchmark | Metric | Base | SFT | Delta | Random | Qwen3-0.6B |
+|-----------|--------|-----:|----:|------:|-------:|-----------:|
+| **HellaSwag** | acc_norm | 28.51 | 27.84 | -0.67 | 25.00 | 41.10 |
+| **ARC-Easy** | acc_norm | 41.04 | 36.11 | -4.93 | 25.00 | 65.60 |
+| **ARC-Challenge** | acc_norm | 22.27 | 24.15 | +1.88 | 25.00 | 33.90 |
+| **PIQA** | acc_norm | 58.87 | 54.52 | -4.35 | 50.00 | 70.00 |
+| **WinoGrande** | acc | 52.17 | 52.72 | +0.55 | 50.00 | 58.50 |
+| **BoolQ** | acc | 61.13 | 55.63 | -5.50 | 50.00 | 69.70 |
+| **MMLU-STEM** | acc (5-shot) | 25.28 | 28.42 | **+3.14** | 25.00 | — |
+| **LAMBADA** | acc | 15.35 | 7.01 | -8.34 | ~0 | — |
+| **OpenBookQA** | acc_norm | 29.00 | 26.80 | -2.20 | 25.00 | — |
+| **SciQ** | acc_norm | 61.20 | 52.70 | -8.50 | 25.00 | — |
+| **Mean** | | 39.48 | 36.59 | **-2.89** | | |
+**Context**: Qwen3-0.6B-Base was trained on ~36T tokens (3,600x our budget). On the 6 tasks with published Qwen3 scores, our SFT model achieves 47-80% of Qwen3 performance. SFT narrows the gap on reasoning tasks like ARC-Challenge (71% of Qwen3, up from 66% pre-SFT).
 <div align="center">
 <img src="figures/sft_base_vs_sft_benchmarks.png" alt="Base vs SFT benchmark comparison" width="80%">