GLM-5.2-504B — REAP keep-168, Router-KD recovered (NVFP4)

A 34%-expert-pruned GLM-5.2 that holds parity with the full unpruned model on a well-powered real-world eval — recovered by training only the router gates (0.016% of params), and quantized to NVFP4 for 8×B200-class serving.

This is the flagship cut of the GLM-5.2 REAP series — the largest and highest-fidelity, a strict superset of the earlier (now-retired) 481B / 469B cuts.

📄 Full technical report: REPORT.md — the complete study (methods, the n=50→n=2000 correction, the significance stats, the GGUF DSA-indexer surgery, negative results, and the free sampler fix).


🙏 Sponsor

All pruning, distillation, and evaluation ran on 8× NVIDIA B200 generously sponsored by Lambda. Thank you, Lambda. 🙏


What it is

GLM-5.2 is a GlmMoeDsaForCausalLM MoE — 78 layers (3 dense + 75 MoE) + 1 MTP layer, 256 routed experts per layer (top-8) + 1 shared expert, DeepSeek-style MLA attention with a DSA sparse "indexer," hidden size 6144.

This model keeps 168 of the 256 routed experts per layer (≈504B params, down from ~744–763B), consistently across every MoE layer and the MTP layer (n_routed_experts: 168), so it loads and serves cleanly in vLLM.

Prune method REAP — saliency = gate_weight × ‖expert_output‖, top-168 kept per layer
Recovery Router-KD — gate-only knowledge distillation to the unpruned teacher
Quantization NVFP4 (modelopt) on routed experts; BF16 router / attention / shared expert
Params ~504B (34% of routed experts pruned)

How it was recovered (Router-KD)

Pruning experts damages routing. Instead of expensive full fine-tuning, we freeze the entire network — experts, attention, embeddings — and train only the 75 router gate matrices (~0.016% of all parameters) to KL-match the unpruned GLM-5.2 teacher's next-token distribution (plain uniform position weighting, lr 5e-5). The gates re-learn to route the surviving 168 experts the way the full model would. This is cheap, fast, and — crucially — it does not touch the model's knowledge, only how it's addressed.

Evaluation — measured honestly, at scale

We evaluate the behavior that actually matters for agent use: does it terminate, or fall into repeat / </think>-restart loops? Probes are 2000 real held-out prompts harvested from genuine coding-agent traces (codex, opencode, cursor, claude-code), raw sampling, no max_tokens, no timeout — loops are detected, never truncated.

metric (n=2000, held-out, raw) this model unpruned teacher
attractor / loop rate 0.072 0.036
natural-EOS rate 0.928 0.965
output diversity (distinct-4) 0.880 0.921
median output length 1267 1207

Honest accounting of the cost (n=2000)

At small n (50 probes) this looked like parity with the teacher. Scaled to 2000 probes, it isn't: the unpruned teacher loops on 3.6% of prompts, this model on 7.2% — so REAP pruning + gate-only Router-KD roughly doubles the loop rate (a statistically significant gap, z≈5, p<0.0001). Router-KD recovers the routing but cannot fully restore the termination behavior carried by the pruned experts, and we confirmed it's a ceiling — a knowledge-augmented expert set and 6× more KD data both failed to close the gap (8.6% with full-data floor KD, i.e. slightly worse). The extra loops concentrate on agentic sources (codex/opencode ~13% here vs ~6% on the teacher). About 1% of prompts loop on the teacher too (intrinsic); the other ~3.6 points are the price of pruning.

A knowledge-recovery LoRA (separate, optional adapter) is being trained to add capacity back and close this gap — see the model index. The pruned model remains a strong ~34%-smaller option; just know it trades ~3.6 points of termination robustness for the

Recommended serving — recover most of the loop gap for free

Two no-retrain knobs, measured at n=2000:

  • Anti-loop (recommended): a sampler guardrail. Measured at n=2000:

    • min_p=0.05, repetition_penalty=1.05 → loop 4.9% (gentle, safe default)
    • min_p=0.05, repetition_penalty=1.10 → loop 2.3%fully recovers the pruning-induced looping (raw is 7.2%; the unpruned teacher is 3.6% raw), with distinct-4 0.95.

    So the "pruning ~doubles looping" cost is entirely recoverable at serving time, for free. Start at 1.05; go to 1.10 if you see loops — a higher repetition penalty trades a little risk of over-penalizing legitimate repetition (e.g. in code) for near-zero looping.

  • Conciseness: a brevity system prompt"Be concise. Think only as much as the task needs, then answer and stop."halves median length (1267 → 507 tokens). Note it does not reduce looping (that's the sampler's job); combine the two for short, low-loop output.

GGUF builds

For llama.cpp / CPU / Metal, see 0xSero/GLM-5.2-REAP-504B-GGUF — BF16 + dynamic Q4 / Q3 / Q2 (with a custom patch to make GLM-5.2's shared-indexer attention loadable in llama.cpp).


REAP expert-pruning + gate-only Router-KD recovery. Compute sponsored by Lambda — thank you. 🙏

Downloads last month
237
Safetensors
Model size
68B params
Tensor type
F32
·
I32
·
BF16
·
F16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for 0xSero/GLM-5.2-504B-W4A16

Base model

zai-org/GLM-5.2
Quantized
(74)
this model

Collection including 0xSero/GLM-5.2-504B-W4A16