tylerxdurden commited on
Commit
3f0c8ee
·
verified ·
1 Parent(s): 4ac133d

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +17 -15
README.md CHANGED
@@ -190,21 +190,23 @@ SFT modifies *what* is computed, not *how*: the routing mechanism (which activat
190
 
191
  All benchmarks via [EleutherAI lm-evaluation-harness](https://github.com/EleutherAI/lm-eval) v0.4.11, 0-shot unless noted.
192
 
193
- ### Benchmarks (Base vs SFT)
194
-
195
- | Benchmark | Metric | Base | SFT | Delta | Random |
196
- |-----------|--------|-----:|----:|------:|-------:|
197
- | **HellaSwag** | acc_norm | 28.51 | 27.84 | -0.67 | 25.00 |
198
- | **ARC-Easy** | acc_norm | 41.04 | 36.11 | -4.93 | 25.00 |
199
- | **ARC-Challenge** | acc_norm | 22.27 | 24.15 | +1.88 | 25.00 |
200
- | **PIQA** | acc_norm | 58.87 | 54.52 | -4.35 | 50.00 |
201
- | **WinoGrande** | acc | 52.17 | 52.72 | +0.55 | 50.00 |
202
- | **BoolQ** | acc | 61.13 | 55.63 | -5.50 | 50.00 |
203
- | **MMLU-STEM** | acc (5-shot) | 25.28 | 28.42 | **+3.14** | 25.00 |
204
- | **LAMBADA** | acc | 15.35 | 7.01 | -8.34 | ~0 |
205
- | **OpenBookQA** | acc_norm | 29.00 | 26.80 | -2.20 | 25.00 |
206
- | **SciQ** | acc_norm | 61.20 | 52.70 | -8.50 | 25.00 |
207
- | **Mean** | | 39.48 | 36.59 | **-2.89** | |
 
 
208
 
209
  <div align="center">
210
  <img src="figures/sft_base_vs_sft_benchmarks.png" alt="Base vs SFT benchmark comparison" width="80%">
 
190
 
191
  All benchmarks via [EleutherAI lm-evaluation-harness](https://github.com/EleutherAI/lm-eval) v0.4.11, 0-shot unless noted.
192
 
193
+ ### Benchmarks (Base vs SFT vs Qwen3-0.6B-Base)
194
+
195
+ | Benchmark | Metric | Base | SFT | Delta | Random | Qwen3-0.6B |
196
+ |-----------|--------|-----:|----:|------:|-------:|-----------:|
197
+ | **HellaSwag** | acc_norm | 28.51 | 27.84 | -0.67 | 25.00 | 41.10 |
198
+ | **ARC-Easy** | acc_norm | 41.04 | 36.11 | -4.93 | 25.00 | 65.60 |
199
+ | **ARC-Challenge** | acc_norm | 22.27 | 24.15 | +1.88 | 25.00 | 33.90 |
200
+ | **PIQA** | acc_norm | 58.87 | 54.52 | -4.35 | 50.00 | 70.00 |
201
+ | **WinoGrande** | acc | 52.17 | 52.72 | +0.55 | 50.00 | 58.50 |
202
+ | **BoolQ** | acc | 61.13 | 55.63 | -5.50 | 50.00 | 69.70 |
203
+ | **MMLU-STEM** | acc (5-shot) | 25.28 | 28.42 | **+3.14** | 25.00 | — |
204
+ | **LAMBADA** | acc | 15.35 | 7.01 | -8.34 | ~0 | — |
205
+ | **OpenBookQA** | acc_norm | 29.00 | 26.80 | -2.20 | 25.00 | — |
206
+ | **SciQ** | acc_norm | 61.20 | 52.70 | -8.50 | 25.00 | — |
207
+ | **Mean** | | 39.48 | 36.59 | **-2.89** | | |
208
+
209
+ **Context**: Qwen3-0.6B-Base was trained on ~36T tokens (3,600x our budget). On the 6 tasks with published Qwen3 scores, our SFT model achieves 47-80% of Qwen3 performance. SFT narrows the gap on reasoning tasks like ARC-Challenge (71% of Qwen3, up from 66% pre-SFT).
210
 
211
  <div align="center">
212
  <img src="figures/sft_base_vs_sft_benchmarks.png" alt="Base vs SFT benchmark comparison" width="80%">