Instructions to use banyaaiofficial/Qwen3.5-122B-A10B-Banya-Tuned-v19-grpo with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use banyaaiofficial/Qwen3.5-122B-A10B-Banya-Tuned-v19-grpo with PEFT:
from peft import PeftModel from transformers import AutoModelForCausalLM base_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3.5-122B-A10B") model = PeftModel.from_pretrained(base_model, "banyaaiofficial/Qwen3.5-122B-A10B-Banya-Tuned-v19-grpo") - Notebooks
- Google Colab
- Kaggle
Qwen3.5-122B-A10B-Banya-Tuned-v19-grpo
Option D3 (Online RLVR) — GRPO trained with real pytest reward signal.
- init: v10 LoRA (30% Pass@1 baseline)
- trainer: TRL GRPOTrainer (Group Relative Policy Optimization)
- rollout: HF model.generate (k=8 per task, T=1.0)
- reward: swebench harness pytest (0=apply_fail / 0.3=apply_ok+test_fail / 1.0=resolved)
- corpus: SWE-bench-Lite 270 train pool (no leakage with stratified-30 eval)
- hyperparams: β KL=0.05, ε clip=0.2, lr=5e-6, 100 steps, k=8
This is the first online-RL variant in the v-series. Previous attempts (v15 156-pair DPO, v16 362-pair DPO with β=0.05, v17 D1.5 test-DPO) hit the offline ceiling — too few valid pairs due to outcome convergence. GRPO bypasses pair requirement entirely (group-relative advantage).
See SFT method doc.
- Downloads last month
- 16
Model tree for banyaaiofficial/Qwen3.5-122B-A10B-Banya-Tuned-v19-grpo
Base model
Qwen/Qwen3.5-122B-A10B