Upload README.md with huggingface_hub
Browse files
README.md
CHANGED
|
@@ -190,21 +190,23 @@ SFT modifies *what* is computed, not *how*: the routing mechanism (which activat
|
|
| 190 |
|
| 191 |
All benchmarks via [EleutherAI lm-evaluation-harness](https://github.com/EleutherAI/lm-eval) v0.4.11, 0-shot unless noted.
|
| 192 |
|
| 193 |
-
### Benchmarks (Base vs SFT)
|
| 194 |
-
|
| 195 |
-
| Benchmark | Metric | Base | SFT | Delta | Random |
|
| 196 |
-
|-----------|--------|-----:|----:|------:|-------:|
|
| 197 |
-
| **HellaSwag** | acc_norm | 28.51 | 27.84 | -0.67 | 25.00 |
|
| 198 |
-
| **ARC-Easy** | acc_norm | 41.04 | 36.11 | -4.93 | 25.00 |
|
| 199 |
-
| **ARC-Challenge** | acc_norm | 22.27 | 24.15 | +1.88 | 25.00 |
|
| 200 |
-
| **PIQA** | acc_norm | 58.87 | 54.52 | -4.35 | 50.00 |
|
| 201 |
-
| **WinoGrande** | acc | 52.17 | 52.72 | +0.55 | 50.00 |
|
| 202 |
-
| **BoolQ** | acc | 61.13 | 55.63 | -5.50 | 50.00 |
|
| 203 |
-
| **MMLU-STEM** | acc (5-shot) | 25.28 | 28.42 | **+3.14** | 25.00 |
|
| 204 |
-
| **LAMBADA** | acc | 15.35 | 7.01 | -8.34 | ~0 |
|
| 205 |
-
| **OpenBookQA** | acc_norm | 29.00 | 26.80 | -2.20 | 25.00 |
|
| 206 |
-
| **SciQ** | acc_norm | 61.20 | 52.70 | -8.50 | 25.00 |
|
| 207 |
-
| **Mean** | | 39.48 | 36.59 | **-2.89** | |
|
|
|
|
|
|
|
| 208 |
|
| 209 |
<div align="center">
|
| 210 |
<img src="figures/sft_base_vs_sft_benchmarks.png" alt="Base vs SFT benchmark comparison" width="80%">
|
|
|
|
| 190 |
|
| 191 |
All benchmarks via [EleutherAI lm-evaluation-harness](https://github.com/EleutherAI/lm-eval) v0.4.11, 0-shot unless noted.
|
| 192 |
|
| 193 |
+
### Benchmarks (Base vs SFT vs Qwen3-0.6B-Base)
|
| 194 |
+
|
| 195 |
+
| Benchmark | Metric | Base | SFT | Delta | Random | Qwen3-0.6B |
|
| 196 |
+
|-----------|--------|-----:|----:|------:|-------:|-----------:|
|
| 197 |
+
| **HellaSwag** | acc_norm | 28.51 | 27.84 | -0.67 | 25.00 | 41.10 |
|
| 198 |
+
| **ARC-Easy** | acc_norm | 41.04 | 36.11 | -4.93 | 25.00 | 65.60 |
|
| 199 |
+
| **ARC-Challenge** | acc_norm | 22.27 | 24.15 | +1.88 | 25.00 | 33.90 |
|
| 200 |
+
| **PIQA** | acc_norm | 58.87 | 54.52 | -4.35 | 50.00 | 70.00 |
|
| 201 |
+
| **WinoGrande** | acc | 52.17 | 52.72 | +0.55 | 50.00 | 58.50 |
|
| 202 |
+
| **BoolQ** | acc | 61.13 | 55.63 | -5.50 | 50.00 | 69.70 |
|
| 203 |
+
| **MMLU-STEM** | acc (5-shot) | 25.28 | 28.42 | **+3.14** | 25.00 | — |
|
| 204 |
+
| **LAMBADA** | acc | 15.35 | 7.01 | -8.34 | ~0 | — |
|
| 205 |
+
| **OpenBookQA** | acc_norm | 29.00 | 26.80 | -2.20 | 25.00 | — |
|
| 206 |
+
| **SciQ** | acc_norm | 61.20 | 52.70 | -8.50 | 25.00 | — |
|
| 207 |
+
| **Mean** | | 39.48 | 36.59 | **-2.89** | | |
|
| 208 |
+
|
| 209 |
+
**Context**: Qwen3-0.6B-Base was trained on ~36T tokens (3,600x our budget). On the 6 tasks with published Qwen3 scores, our SFT model achieves 47-80% of Qwen3 performance. SFT narrows the gap on reasoning tasks like ARC-Challenge (71% of Qwen3, up from 66% pre-SFT).
|
| 210 |
|
| 211 |
<div align="center">
|
| 212 |
<img src="figures/sft_base_vs_sft_benchmarks.png" alt="Base vs SFT benchmark comparison" width="80%">
|