Text Generation
PEFT
Safetensors
lora
grpo
swe-bench
code
banya
rlvr
dense-reward
ablation
conversational
Instructions to use banyaaiofficial/Qwen3.5-122B-A10B-Banya-Tuned-v20-grpo-ckpt80 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use banyaaiofficial/Qwen3.5-122B-A10B-Banya-Tuned-v20-grpo-ckpt80 with PEFT:
from peft import PeftModel from transformers import AutoModelForCausalLM base_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3.5-122B-A10B") model = PeftModel.from_pretrained(base_model, "banyaaiofficial/Qwen3.5-122B-A10B-Banya-Tuned-v20-grpo-ckpt80") - Notebooks
- Google Colab
- Kaggle
metadata
license: apache-2.0
base_model: Qwen/Qwen3.5-122B-A10B
library_name: peft
pipeline_tag: text-generation
tags:
- lora
- peft
- grpo
- swe-bench
- code
- banya
- rlvr
- dense-reward
- ablation
Qwen3.5-122B-A10B-Banya-Tuned-v20-grpo-ckpt80
Checkpoint at step 80 of v20 GRPO training (intermediate snapshot).
- init: v5 LoRA (mix corpus, 30% Pass@1 baseline)
- trainer: TRL GRPOTrainer
- rollout: HF model.generate (k=8, T=1.0)
- reward: dense [0, 1.0] = parse 0.05 + grep 0.05 + file 0.10 + func 0.10 + harness 0.30/0.70
- MoE safeguards: output_router_logits + aux loss + explicit router freeze
- corpus: SWE-bench-Lite 50-task train pool (subset of 270 non-eval)
- hyperparams: β=0.1, ε=0.2, lr=1e-6, 80 steps (intermediate, full = 100)
30-task smoke result: 7/30 = 23.3% Pass@1 (same as final/step 100).
Specialization finding: this checkpoint and the step-100 final share
only 3/7 PASS tasks. Together they cover 11/30 = 36.7% (oracle ensemble).
See companion repo
Qwen3.5-122B-A10B-Banya-Tuned-v20-grpo
for step-100 final.
v20 training journey
See Banya SFT method doc for full v5 → v20 → v21 pipeline + ablation context.