Pruning GLM-5.2 to 504B with REAP + Router-KD: an honest, well-powered study of what it costs β and a free fix
A technical report on expert-pruning GLM-5.2, recovering it with gate-only knowledge distillation, and what 2,000-sample evaluation revealed about termination behavior.
Compute: 8Γ NVIDIA B200, sponsored by Lambda. Models: 0xSero on HuggingFace.
Abstract
We prune GLM-5.2 (β744β763B parameters, GlmMoeDsaForCausalLM) from 256 to 168 routed experts per layer with REAP saliency, yielding a ~504B (34%-pruned) model, and recover it with gate-only Router-KD β training only the 75 router-gate matrices (~0.016% of parameters) to KL-match the unpruned teacher. We release the model in NVFP4 and GGUF.
Our central finding is methodological as much as it is about the model. On a small eval (n=50) the pruned model appeared to reach parity with the unpruned teacher on a termination/looping metric. Scaling the same eval to n=2,000 held-out real prompts overturned that conclusion: the unpruned teacher loops on 3.6% of prompts, the pruned model on 7.2% β pruning roughly doubles the loop rate (two-proportion zβ5.0, p<0.0001). Gate-only KD recovers routing but plateaus on termination: a knowledge-biased expert selection and 6Γ more KD data both failed to close the gap (the latter made it slightly worse). However, a free, no-retraining sampler guardrail (min_p=0.05, repetition_penalty=1.10) drops the pruned model to 2.3% looping β fully recovering, and exceeding, the pruning cost at serving time. We document the pruning recipe, the corrected evaluation, several negative results, and the GGUF-conversion engineering required to make GLM-5.2's shared sparse-attention indexer loadable in llama.cpp.
1. Background
GLM-5.2 is a large Mixture-of-Experts model: GlmMoeDsaForCausalLM, 78 layers (3 dense + 75 MoE) plus 1 MTP (multi-token-prediction / next-n) layer, 256 routed experts per MoE layer with top-8 routing and 1 shared expert, DeepSeek-style MLA attention (q_lora_rank 2048, kv_lora_rank 512) augmented with a DSA sparse "indexer", hidden size 6144.
REAP (Router-weighted Expert Activation Pruning) scores each expert by saliency = gate_weight Γ βexpert_outputβ over a calibration set, and keeps the top-K per layer. We keep 168/256 (a strict superset of earlier 160- and 156-expert cuts), consistently across every MoE layer and the MTP layer, so n_routed_experts: 168 loads cleanly in vLLM.
The objective was not raw accuracy but agent-grade behavior: does the pruned model terminate, or fall into repeat / </think>-restart loops? Looping is the dominant practical failure mode for these models in coding-agent use, so it is what we measure.
2. Methods
2.1 Pruning + recovery
- Prune: REAP saliency β top-168 experts/layer (incl. MTP),
n_routed_experts: 168. ~504B params. - Quantize: NVFP4 (NVIDIA modelopt) on routed experts; BF16 router / attention / shared expert.
- Recover (Router-KD): freeze the entire network; train only the 75 router-gate matrices (~0.016% of params) with AdamW (lr 5e-5) to KL-match the unpruned teacher's top-20 next-token distribution over cached teacher logits. This re-teaches routing of the surviving experts without touching their weights β cheap, fast, and non-destructive to knowledge.
2.2 Evaluation
- Probes: real held-out prompts harvested from the author's own coding-agent traces (codex, opencode-cli, cursor, claude-code) β the actual target distribution.
- Protocol: raw sampling, no
max_tokens, no timeout. Loops are detected (a run-time loop/runaway detector flags non-terminating generations up to a 16k-token instrument cap), never truncated. Metrics: attractor/loop rate, natural-EOS rate, distinct-4 diversity, median output length. - The key methodological choice: we scaled from n=50 β n=2,000 probes. At n=50 a "win" is 4-vs-5 failures β pure noise. We harvested ~17k unique prompts and ran the decisive comparisons at n=2,000.
3. Results
3.1 The leaderboard (n=2,000, raw sampling)
| model / config | loop rate | 95% CI | EOS | distinct-4 | median tok |
|---|---|---|---|---|---|
| unpruned teacher (β744B) | 3.6% | [2.8β4.5] | 0.965 | 0.921 | 1207 |
| floor keep-168, 3k-KD (shipped) | 7.2% | [6.1β8.4] | 0.928 | 0.880 | 1267 |
| K-cut keep-168-K, 3k-KD | 7.6% | [6.5β8.8] | 0.925 | 0.886 | 1230 |
| K-cut keep-168-K, full 18.6k-KD | 7.8% | [6.7β9.1] | 0.924 | 0.881 | 1232 |
| floor keep-168, full 18.6k-KD | 8.6% | [7.4β9.9] | 0.915 | 0.872 | 1301 |
3.2 Pruning roughly doubles the loop rate
The pruned floor (7.2%) vs the unpruned teacher (3.6%) is a statistically significant gap: two-proportion z β 5.0, p < 0.0001. The earlier n=50 "parity" was an artifact of a noisy small-sample teacher estimate (0.10 at n=50 vs its true 0.036). Honest conclusion: REAP pruning costs ~3.6 points of termination robustness. Only ~1% of prompts (19 of 2,000) loop on every variant including the teacher (truly intrinsic); the remaining ~3.6 points are pruning-induced.
The cost concentrates on the hardest sources: codex and opencode-cli loop ~13% on the pruned model vs ~6% on the teacher; cursor-composer/cursor-chat stay low (1β3%) on both.
3.3 Gate-only KD plateaus
Two interventions aimed at closing the gap both failed:
- Knowledge-biased experts (the "K-cut"): replacing 8 experts/layer with the highest-saliency knowledge/reasoning experts that coding-saliency pruning drops β no improvement (7.6β7.8%, within noise).
- 6Γ more KD data (18,599 vs 2,999 calibration traces) β slightly worse (floor 7.2β8.6%, K-cut 7.6β7.8%). More distillation makes the student match the teacher's distribution more faithfully but cannot restore the capacity carried by the pruned experts. Gate-only KD has a ceiling.
3.4 The free fix: a sampler guardrail
What gate-KD could not do, a serving-time sampler knob does β for free (n=2,000, on the shipped floor):
| guardrail | loop rate |
|---|---|
| none (raw) | 7.2% |
min_p=0.05, rep_pen=1.05 |
4.9% (p=0.002 vs raw) |
min_p=0.10, rep_pen=1.10 |
3.3% |
min_p=0.05, rep_pen=1.10 |
2.3% |
A light repetition penalty (1.10) takes the pruned model to 2.3% β below the teacher's raw 3.6%, and with distinct-4 rising to 0.95. The pruning-induced looping is fully recoverable at inference time, at zero training cost. (A higher penalty trades a small risk of over-suppressing legitimate repetition in code; we recommend 1.05 by default, 1.10 if loops appear.) Separately, a brevity system prompt halves median output length (1267β507) but does not reduce looping β it is for conciseness, not termination.
4. GGUF conversion: making GLM-5.2's DSA indexer loadable
GLM-5.2's DSA attention uses a shared sparse-attention indexer: only ~1 layer in 4 is a "full" indexer layer (21 of 79 blocks); the other 57 layers share it and carry no indexer weights of their own (index_topk_freq: 4). Stock llama.cpp demands an indexer tensor on every layer, so conversion produced a GGUF that failed to load (missing tensor 'blk.3.indexer.k_norm.weight').
Fix: we filled each of the 57 shared layers' indexer tensors (285 tensors total) by duplicating them from the nearest preceding full layer. The patched GGUF loads and generates coherently. A second snag β the quantizer rejecting the MTP layer (blk.78, layer index = n_layer) β was resolved by pinning that layer's expert tensors with explicit per-tensor types (--tensor-type blk.78.*=q6_K), preserving MTP (self-speculative decoding) in every quant. Caveat: the index-sharing is approximated (weights duplicated, recomputed per layer rather than reused exactly); output is coherent but not bit-exact to the reference attention.
We release BF16 + dynamic Q4_K_XL / Q3_K_XL / Q2_K_XL (per-tensor precision: attention/shared-expert/embeddings/output kept high, routed experts pushed low). Imatrix calibration was not applied (GPU-time constraints).
5. Negative results (reported for honesty)
- Knowledge-augmented expert selection did not reduce looping.
- 6Γ more KD data did not help (slightly hurt).
- A knowledge-recovery LoRA (the intuitively right lever β add capacity back, distilled from the teacher) was attempted 6Γ and did not train: the 933 GB BF16 model needs ~170 GiB/GPU just to load (uneven MoE packing), leaving no room on GPU0 for the LoRA forward (a 15.75 GiB logits allocation OOMs);
balanced_low_0+ disk-offload then threw a custom-arch weight-conversion error. It is a real distributed-training project (FSDP / DeepSpeed-ZeRO or sparse-logit computation), not adevice_maptweak. Left as future work.
6. Limitations
- The eval measures termination/looping on one author's agent-trace distribution, not general capability or knowledge benchmarks. Pruned-vs-teacher knowledge differences are not quantified here.
- The loop detector's 16k-token instrument cap means borderline-long generations are classified as loops; ~86% of flagged loops are variant-specific (near the cutoff), only ~1β2% are a stable intrinsic core.
- GGUF index-sharing is approximated (Β§4). Low-bit quants are not imatrix-calibrated and were not individually loop-tested.
- Teacher-with-guardrail was not measured (the pod was returned), so "2.3% beats the teacher" is vs the teacher's raw 3.6%.
7. Released artifacts
0xSero/GLM-5.2-504Bβ the recommended model (keep-168, NVFP4). Serve withmin_p=0.05, repetition_penalty=1.05β1.10.0xSero/GLM-5.2-REAP-504B-GGUFβ BF16 + dynamic Q4/Q3/Q2 (MTP-preserving, DSA-indexer-patched) for llama.cpp.0xSero/GLM-5.2-504B-Kβ knowledge-augmented variant (full-data KD).0xSero/GLM-5.2-504B-FullKDβ full-data plain-KD variant (ablation).
8. Conclusion
REAP + gate-only Router-KD produces a usable, 34%-smaller GLM-5.2, but β measured honestly at scale β pruning does cost termination robustness (it ~doubles looping), and gate-only distillation cannot recover it. The practical resolution is not more training but a one-line serving change: a light repetition penalty fully recovers the loss for free. The broader lesson is methodological: small-n behavioral evals on rare-event metrics are dangerously noisy β the difference between "beats the teacher" and "loops twice as much" was entirely a sample-size artifact.
*Pruning, distillation, evaluation, and analysis on 8Γ NVIDIA B200 sponsored by Lambda. π*