Text Generation
PEFT
Safetensors
lora
grpo
swe-bench
code
banya
rlvr
dense-reward
checkpoint
conversational
Instructions to use banyaaiofficial/Qwen3.5-122B-A10B-Banya-Tuned-v21-grpo-ckpt64 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use banyaaiofficial/Qwen3.5-122B-A10B-Banya-Tuned-v21-grpo-ckpt64 with PEFT:
from peft import PeftModel from transformers import AutoModelForCausalLM base_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3.5-122B-A10B") model = PeftModel.from_pretrained(base_model, "banyaaiofficial/Qwen3.5-122B-A10B-Banya-Tuned-v21-grpo-ckpt64") - Notebooks
- Google Colab
- Kaggle
Qwen3.5-122B-A10B-Banya-Tuned-v21-grpo-ckpt64
Intermediate checkpoint of v21 training (step 64 / 80, 80%) — saved for capability-peak analysis.
Context
v21 = v10 (Masked SFT) init + dense reward GRPO + task pool 270. This checkpoint captures the policy state immediately after Burst 2 of REAL PASS hits:
- step 53-61 (9 steps): 6 REAL PASS hits (peak density in run)
- step 62-65 (cooldown): no hits (model reorganizing)
- step 64 = end of cooldown, before next training phase
Hypothesis: ckpt-64 may have better generalization than final ckpt-80 if the last 16 steps
overfit on the training pool. Compare side-by-side with Qwen3.5-122B-A10B-Banya-Tuned-v21-grpo
(= ckpt-80 final).
Training stats up to step 64
- REAL PASS rate: 22/64 = 34.4% (vs final 24/80 = 30.0%)
- train_loss (smoothed): ~0.08
- KL band: 0.04-0.07 (stable)
Use case
Smoke compare ckpt-64 vs ckpt-80 to check whether the final 16 steps gained or lost capability. Reference: docs/sft-dense-grpo.md §11 (9-way matrix).
- Downloads last month
- 10
Model tree for banyaaiofficial/Qwen3.5-122B-A10B-Banya-Tuned-v21-grpo-ckpt64
Base model
Qwen/Qwen3.5-122B-A10B