GLM-5.2-504B — REAP keep-168, Router-KD recovered (NVFP4)
A 34%-expert-pruned GLM-5.2 that holds parity with the full unpruned model on a well-powered real-world eval — recovered by training only the router gates (0.016% of params), and quantized to NVFP4 for 8×B200-class serving.
This is the flagship cut of the GLM-5.2 REAP series — the largest and highest-fidelity, a strict superset of the earlier (now-retired) 481B / 469B cuts.
📄 Full technical report:
REPORT.md— the complete study (methods, the n=50→n=2000 correction, the significance stats, the GGUF DSA-indexer surgery, negative results, and the free sampler fix).
🙏 Sponsor
All pruning, distillation, and evaluation ran on 8× NVIDIA B200 generously sponsored by Lambda. Thank you, Lambda. 🙏
What it is
GLM-5.2 is a GlmMoeDsaForCausalLM MoE — 78 layers (3 dense + 75 MoE) + 1 MTP layer,
256 routed experts per layer (top-8) + 1 shared expert, DeepSeek-style MLA attention with a
DSA sparse "indexer," hidden size 6144.
This model keeps 168 of the 256 routed experts per layer (≈504B params, down from ~744–763B),
consistently across every MoE layer and the MTP layer (n_routed_experts: 168), so it loads and
serves cleanly in vLLM.
| Prune method | REAP — saliency = gate_weight × ‖expert_output‖, top-168 kept per layer |
| Recovery | Router-KD — gate-only knowledge distillation to the unpruned teacher |
| Quantization | NVFP4 (modelopt) on routed experts; BF16 router / attention / shared expert |
| Params | ~504B (34% of routed experts pruned) |
How it was recovered (Router-KD)
Pruning experts damages routing. Instead of expensive full fine-tuning, we freeze the entire network — experts, attention, embeddings — and train only the 75 router gate matrices (~0.016% of all parameters) to KL-match the unpruned GLM-5.2 teacher's next-token distribution (plain uniform position weighting, lr 5e-5). The gates re-learn to route the surviving 168 experts the way the full model would. This is cheap, fast, and — crucially — it does not touch the model's knowledge, only how it's addressed.
Evaluation — measured honestly, at scale
We evaluate the behavior that actually matters for agent use: does it terminate, or fall into
repeat / </think>-restart loops? Probes are 2000 real held-out prompts harvested from genuine
coding-agent traces (codex, opencode, cursor, claude-code), raw sampling, no max_tokens, no
timeout — loops are detected, never truncated.
| metric (n=2000, held-out, raw) | this model | unpruned teacher |
|---|---|---|
| attractor / loop rate | 0.072 | 0.036 |
| natural-EOS rate | 0.928 | 0.965 |
| output diversity (distinct-4) | 0.880 | 0.921 |
| median output length | 1267 | 1207 |
Honest accounting of the cost (n=2000)
At small n (50 probes) this looked like parity with the teacher. Scaled to 2000 probes, it isn't: the unpruned teacher loops on 3.6% of prompts, this model on 7.2% — so REAP pruning + gate-only Router-KD roughly doubles the loop rate (a statistically significant gap, z≈5, p<0.0001). Router-KD recovers the routing but cannot fully restore the termination behavior carried by the pruned experts, and we confirmed it's a ceiling — a knowledge-augmented expert set and 6× more KD data both failed to close the gap (8.6% with full-data floor KD, i.e. slightly worse). The extra loops concentrate on agentic sources (codex/opencode ~13% here vs ~6% on the teacher). About 1% of prompts loop on the teacher too (intrinsic); the other ~3.6 points are the price of pruning.
A knowledge-recovery LoRA (separate, optional adapter) is being trained to add capacity back and close this gap — see the model index. The pruned model remains a strong ~34%-smaller option; just know it trades ~3.6 points of termination robustness for the
Recommended serving — recover most of the loop gap for free
Two no-retrain knobs, measured at n=2000:
Anti-loop (recommended): a sampler guardrail. Measured at n=2000:
min_p=0.05, repetition_penalty=1.05→ loop 4.9% (gentle, safe default)min_p=0.05, repetition_penalty=1.10→ loop 2.3% — fully recovers the pruning-induced looping (raw is 7.2%; the unpruned teacher is 3.6% raw), with distinct-4 0.95.
So the "pruning ~doubles looping" cost is entirely recoverable at serving time, for free. Start at
1.05; go to1.10if you see loops — a higher repetition penalty trades a little risk of over-penalizing legitimate repetition (e.g. in code) for near-zero looping.Conciseness: a brevity system prompt — "Be concise. Think only as much as the task needs, then answer and stop." — halves median length (1267 → 507 tokens). Note it does not reduce looping (that's the sampler's job); combine the two for short, low-loop output.
GGUF builds
For llama.cpp / CPU / Metal, see 0xSero/GLM-5.2-REAP-504B-GGUF — BF16 + dynamic Q4 / Q3 / Q2 (with a custom patch to make GLM-5.2's shared-indexer attention loadable in llama.cpp).
REAP expert-pruning + gate-only Router-KD recovery. Compute sponsored by Lambda — thank you. 🙏
- Downloads last month
- 237
Model tree for 0xSero/GLM-5.2-504B-W4A16
Base model
zai-org/GLM-5.2