Qwen3.5-122B-A10B-Banya-Tuned-v21-grpo-ckpt64

Intermediate checkpoint of v21 training (step 64 / 80, 80%) — saved for capability-peak analysis.

Context

v21 = v10 (Masked SFT) init + dense reward GRPO + task pool 270. This checkpoint captures the policy state immediately after Burst 2 of REAL PASS hits:

step 53-61 (9 steps): 6 REAL PASS hits (peak density in run)
step 62-65 (cooldown): no hits (model reorganizing)
step 64 = end of cooldown, before next training phase

Hypothesis: ckpt-64 may have better generalization than final ckpt-80 if the last 16 steps overfit on the training pool. Compare side-by-side with Qwen3.5-122B-A10B-Banya-Tuned-v21-grpo (= ckpt-80 final).

Training stats up to step 64

REAL PASS rate: 22/64 = 34.4% (vs final 24/80 = 30.0%)
train_loss (smoothed): ~0.08
KL band: 0.04-0.07 (stable)

Use case

Smoke compare ckpt-64 vs ckpt-80 to check whether the final 16 steps gained or lost capability. Reference: docs/sft-dense-grpo.md §11 (9-way matrix).

Downloads last month: 10

Model tree for banyaaiofficial/Qwen3.5-122B-A10B-Banya-Tuned-v21-grpo-ckpt64

Base model

Qwen/Qwen3.5-122B-A10B

Adapter

(10)

this model