Qwen2.5-7B-Instruct LoRA distill on s1K oracle traces (8 epochs)

LoRA adapter that distills qwen3-235b-a22b oracle reasoning traces (clean inference, no attack) on s1K-1.1 questions into Qwen2.5-7B-Instruct.

Recipe

field	value
student	`Qwen/Qwen2.5-7B-Instruct`
teacher	`qwen3-235b-a22b` (clean inference via OpenRouter, no V3 attack, no ICL)
dataset	`Chia-Mu-Lab/s1k-qwen3-235b-oracle-traces` (963 rows)
finetune	LoRA (r=8, alpha=16, target_modules=all-linear)
cutoff_len	16384
lr	1e-5, cosine, warmup_ratio=0.1
epochs	8
eff. batch	12 (1 × grad_accum 3 × 4 × H200)
save_steps	every 21 (1 ckpt / 0.33 epoch)
steps/epoch	63.9 (1000 Qs packed to 759 rows / batch 12)

Per-checkpoint MATH500

Evaluated via SGLang with max_new_tokens=12288.

checkpoint	epoch	MATH500
0	0.00	73.20% (base Qwen2.5-7B-Instruct)
21	0.33	73.40%
42	0.66	72.40%
63	0.99	75.00% PEAK (+1.80pp)
84	1.31	73.80%
105	1.64	72.40%
126	1.97	71.00%
147	2.30	72.00%
168	2.63	72.20%
189	2.96	72.40%
210	3.29	71.80%
231	3.62	73.00%
252	3.94	71.20%
273	4.27	72.40%
294	4.60	71.20%
315	4.93	72.40%
336	5.26	71.20%
357	5.59	70.60%
378	5.92	71.60%
399	6.24	72.80%
420	6.57	71.40%
441	6.90	71.00%
462	7.23	72.80%
483	7.56	73.00%
504	7.89	70.00%
512	8.00	70.40% (final)

Overfitting note

Peak accuracy is at checkpoint-63 (epoch ≈ 1.0), and the final 8-epoch checkpoint UNDERPERFORMS the peak by 4.6pp. Downstream users should grab checkpoint-63, not the final checkpoint. This is consistent with the s1 paper observation that 1 epoch on s1K is sufficient.

Usage

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

base = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-7B-Instruct", torch_dtype="auto", device_map="auto")
tok  = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-7B-Instruct")

# Peak checkpoint (recommended)
model = PeftModel.from_pretrained(base, "Chia-Mu-Lab/qwen2.5-7b-s1k-qwen3-235b-oracle-lora-8ep", subfolder="checkpoint-63")

Note: `checkpoint-0` not included

The repo does not include checkpoint-0 — it is the bare Qwen2.5-7B-Instruct base model with no adapter weights. Use the base model directly if you want the 73.20% baseline.

Citation

s1K-1.1 question pool from the s1 paper:

@article{muennighoff2025s1,
  title  = {s1: Simple test-time scaling},
  author = {Muennighoff, Niklas and Yang, Zitong and Shi, Weijia and others},
  journal= {arXiv:2501.19393},
  year   = {2025}
}

The reasoning traces in this distill are fresh from qwen3-235b-a22b (not the s1K-1.1 published traces).

Downloads last month: -

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Chia-Mu-Lab/qwen2.5-7b-s1k-qwen3-235b-oracle-lora-8ep

Base model

Qwen/Qwen2.5-7B

Finetuned

Qwen/Qwen2.5-7B-Instruct

Adapter

(2183)

this model

Dataset used to train Chia-Mu-Lab/qwen2.5-7b-s1k-qwen3-235b-oracle-lora-8ep

Paper for Chia-Mu-Lab/qwen2.5-7b-s1k-qwen3-235b-oracle-lora-8ep

s1: Simple test-time scaling

Paper • 2501.19393 • Published Jan 31, 2025 • 126