--- license: apache-2.0 base_model: Qwen/Qwen3.5-122B-A10B library_name: peft pipeline_tag: text-generation tags: [lora, peft, grpo, swe-bench, code, banya, rlvr, dense-reward, ablation] --- # Qwen3.5-122B-A10B-Banya-Tuned-v20-grpo-ckpt80 Checkpoint at step 80 of v20 GRPO training (intermediate snapshot). - **init**: v5 LoRA (mix corpus, 30% Pass@1 baseline) - **trainer**: TRL GRPOTrainer - **rollout**: HF model.generate (k=8, T=1.0) - **reward**: dense [0, 1.0] = parse 0.05 + grep 0.05 + file 0.10 + func 0.10 + harness 0.30/0.70 - **MoE safeguards**: output_router_logits + aux loss + explicit router freeze - **corpus**: SWE-bench-Lite 50-task train pool (subset of 270 non-eval) - **hyperparams**: β=0.1, ε=0.2, lr=1e-6, 80 steps (intermediate, full = 100) **30-task smoke result**: 7/30 = 23.3% Pass@1 (same as final/step 100). **Specialization finding**: this checkpoint and the step-100 final share only 3/7 PASS tasks. Together they cover 11/30 = 36.7% (oracle ensemble). See companion repo [`Qwen3.5-122B-A10B-Banya-Tuned-v20-grpo`](https://huggingface.co/banyaaiofficial/Qwen3.5-122B-A10B-Banya-Tuned-v20-grpo) for step-100 final. ## v20 training journey See [Banya SFT method doc](https://github.com/kr-ai-dev-association/banya-framework/blob/main/agent-evaluation/docs/sft-dense-grpo.md) for full v5 → v20 → v21 pipeline + ablation context.