Pruning GLM-5.2 to 504B with REAP + Router-KD: an honest, well-powered study of what it costs — and a free fix

A technical report on expert-pruning GLM-5.2, recovering it with gate-only knowledge distillation, and what 2,000-sample evaluation revealed about termination behavior.

Compute: 8× NVIDIA B200, sponsored by Lambda. Models: 0xSero on HuggingFace.

Abstract

We prune GLM-5.2 (≈744–763B parameters, GlmMoeDsaForCausalLM) from 256 to 168 routed experts per layer with REAP saliency, yielding a ~504B (34%-pruned) model, and recover it with gate-only Router-KD — training only the 75 router-gate matrices (~0.016% of parameters) to KL-match the unpruned teacher. We release the model in NVFP4 and GGUF.

Our central finding is methodological as much as it is about the model. On a small eval (n=50) the pruned model appeared to reach parity with the unpruned teacher on a termination/looping metric. Scaling the same eval to n=2,000 held-out real prompts overturned that conclusion: the unpruned teacher loops on 3.6% of prompts, the pruned model on 7.2% — pruning roughly doubles the loop rate (two-proportion z≈5.0, p<0.0001). Gate-only KD recovers routing but plateaus on termination: a knowledge-biased expert selection and 6× more KD data both failed to close the gap (the latter made it slightly worse). However, a free, no-retraining sampler guardrail (min_p=0.05, repetition_penalty=1.10) drops the pruned model to 2.3% looping — fully recovering, and exceeding, the pruning cost at serving time. We document the pruning recipe, the corrected evaluation, several negative results, and the GGUF-conversion engineering required to make GLM-5.2's shared sparse-attention indexer loadable in llama.cpp.

1. Background

GLM-5.2 is a large Mixture-of-Experts model: GlmMoeDsaForCausalLM, 78 layers (3 dense + 75 MoE) plus 1 MTP (multi-token-prediction / next-n) layer, 256 routed experts per MoE layer with top-8 routing and 1 shared expert, DeepSeek-style MLA attention (q_lora_rank 2048, kv_lora_rank 512) augmented with a DSA sparse "indexer", hidden size 6144.

REAP (Router-weighted Expert Activation Pruning) scores each expert by saliency = gate_weight × ‖expert_output‖ over a calibration set, and keeps the top-K per layer. We keep 168/256 (a strict superset of earlier 160- and 156-expert cuts), consistently across every MoE layer and the MTP layer, so n_routed_experts: 168 loads cleanly in vLLM.

The objective was not raw accuracy but agent-grade behavior: does the pruned model terminate, or fall into repeat / </think>-restart loops? Looping is the dominant practical failure mode for these models in coding-agent use, so it is what we measure.

2. Methods

2.1 Pruning + recovery

Prune: REAP saliency → top-168 experts/layer (incl. MTP), n_routed_experts: 168. ~504B params.
Quantize: NVFP4 (NVIDIA modelopt) on routed experts; BF16 router / attention / shared expert.
Recover (Router-KD): freeze the entire network; train only the 75 router-gate matrices (~0.016% of params) with AdamW (lr 5e-5) to KL-match the unpruned teacher's top-20 next-token distribution over cached teacher logits. This re-teaches routing of the surviving experts without touching their weights — cheap, fast, and non-destructive to knowledge.

2.2 Evaluation

Probes: real held-out prompts harvested from the author's own coding-agent traces (codex, opencode-cli, cursor, claude-code) — the actual target distribution.
Protocol: raw sampling, no max_tokens, no timeout. Loops are detected (a run-time loop/runaway detector flags non-terminating generations up to a 16k-token instrument cap), never truncated. Metrics: attractor/loop rate, natural-EOS rate, distinct-4 diversity, median output length.
The key methodological choice: we scaled from n=50 → n=2,000 probes. At n=50 a "win" is 4-vs-5 failures — pure noise. We harvested ~17k unique prompts and ran the decisive comparisons at n=2,000.

3. Results

3.1 The leaderboard (n=2,000, raw sampling)

model / config	loop rate	95% CI	EOS	distinct-4	median tok
unpruned teacher (≈744B)	3.6%	[2.8–4.5]	0.965	0.921	1207
floor keep-168, 3k-KD (shipped)	7.2%	[6.1–8.4]	0.928	0.880	1267
K-cut keep-168-K, 3k-KD	7.6%	[6.5–8.8]	0.925	0.886	1230
K-cut keep-168-K, full 18.6k-KD	7.8%	[6.7–9.1]	0.924	0.881	1232
floor keep-168, full 18.6k-KD	8.6%	[7.4–9.9]	0.915	0.872	1301

3.2 Pruning roughly doubles the loop rate

The pruned floor (7.2%) vs the unpruned teacher (3.6%) is a statistically significant gap: two-proportion z ≈ 5.0, p < 0.0001. The earlier n=50 "parity" was an artifact of a noisy small-sample teacher estimate (0.10 at n=50 vs its true 0.036). Honest conclusion: REAP pruning costs ~3.6 points of termination robustness. Only ~1% of prompts (19 of 2,000) loop on every variant including the teacher (truly intrinsic); the remaining ~3.6 points are pruning-induced.

The cost concentrates on the hardest sources: codex and opencode-cli loop ~13% on the pruned model vs ~6% on the teacher; cursor-composer/cursor-chat stay low (1–3%) on both.

3.3 Gate-only KD plateaus

Two interventions aimed at closing the gap both failed:

Knowledge-biased experts (the "K-cut"): replacing 8 experts/layer with the highest-saliency knowledge/reasoning experts that coding-saliency pruning drops — no improvement (7.6–7.8%, within noise).
6× more KD data (18,599 vs 2,999 calibration traces) — slightly worse (floor 7.2→8.6%, K-cut 7.6→7.8%). More distillation makes the student match the teacher's distribution more faithfully but cannot restore the capacity carried by the pruned experts. Gate-only KD has a ceiling.

3.4 The free fix: a sampler guardrail

What gate-KD could not do, a serving-time sampler knob does — for free (n=2,000, on the shipped floor):

guardrail	loop rate
none (raw)	7.2%
`min_p=0.05, rep_pen=1.05`	4.9% (p=0.002 vs raw)
`min_p=0.10, rep_pen=1.10`	3.3%
`min_p=0.05, rep_pen=1.10`	2.3%

A light repetition penalty (1.10) takes the pruned model to 2.3% — below the teacher's raw 3.6%, and with distinct-4 rising to 0.95. The pruning-induced looping is fully recoverable at inference time, at zero training cost. (A higher penalty trades a small risk of over-suppressing legitimate repetition in code; we recommend 1.05 by default, 1.10 if loops appear.) Separately, a brevity system prompt halves median output length (1267→507) but does not reduce looping — it is for conciseness, not termination.

4. GGUF conversion: making GLM-5.2's DSA indexer loadable

GLM-5.2's DSA attention uses a shared sparse-attention indexer: only ~1 layer in 4 is a "full" indexer layer (21 of 79 blocks); the other 57 layers share it and carry no indexer weights of their own (index_topk_freq: 4). Stock llama.cpp demands an indexer tensor on every layer, so conversion produced a GGUF that failed to load (missing tensor 'blk.3.indexer.k_norm.weight').

Fix: we filled each of the 57 shared layers' indexer tensors (285 tensors total) by duplicating them from the nearest preceding full layer. The patched GGUF loads and generates coherently. A second snag — the quantizer rejecting the MTP layer (blk.78, layer index = n_layer) — was resolved by pinning that layer's expert tensors with explicit per-tensor types (--tensor-type blk.78.*=q6_K), preserving MTP (self-speculative decoding) in every quant. Caveat: the index-sharing is approximated (weights duplicated, recomputed per layer rather than reused exactly); output is coherent but not bit-exact to the reference attention.

We release BF16 + dynamic Q4_K_XL / Q3_K_XL / Q2_K_XL (per-tensor precision: attention/shared-expert/embeddings/output kept high, routed experts pushed low). Imatrix calibration was not applied (GPU-time constraints).

5. Negative results (reported for honesty)

Knowledge-augmented expert selection did not reduce looping.
6× more KD data did not help (slightly hurt).
A knowledge-recovery LoRA (the intuitively right lever — add capacity back, distilled from the teacher) was attempted 6× and did not train: the 933 GB BF16 model needs ~170 GiB/GPU just to load (uneven MoE packing), leaving no room on GPU0 for the LoRA forward (a 15.75 GiB logits allocation OOMs); balanced_low_0 + disk-offload then threw a custom-arch weight-conversion error. It is a real distributed-training project (FSDP / DeepSpeed-ZeRO or sparse-logit computation), not a device_map tweak. Left as future work.

6. Limitations

The eval measures termination/looping on one author's agent-trace distribution, not general capability or knowledge benchmarks. Pruned-vs-teacher knowledge differences are not quantified here.
The loop detector's 16k-token instrument cap means borderline-long generations are classified as loops; ~86% of flagged loops are variant-specific (near the cutoff), only ~1–2% are a stable intrinsic core.
GGUF index-sharing is approximated (§4). Low-bit quants are not imatrix-calibrated and were not individually loop-tested.
Teacher-with-guardrail was not measured (the pod was returned), so "2.3% beats the teacher" is vs the teacher's raw 3.6%.

7. Released artifacts

0xSero/GLM-5.2-504B — the recommended model (keep-168, NVFP4). Serve with min_p=0.05, repetition_penalty=1.05–1.10.
0xSero/GLM-5.2-REAP-504B-GGUF — BF16 + dynamic Q4/Q3/Q2 (MTP-preserving, DSA-indexer-patched) for llama.cpp.
0xSero/GLM-5.2-504B-K — knowledge-augmented variant (full-data KD).
0xSero/GLM-5.2-504B-FullKD — full-data plain-KD variant (ablation).

8. Conclusion

REAP + gate-only Router-KD produces a usable, 34%-smaller GLM-5.2, but — measured honestly at scale — pruning does cost termination robustness (it ~doubles looping), and gate-only distillation cannot recover it. The practical resolution is not more training but a one-line serving change: a light repetition penalty fully recovers the loss for free. The broader lesson is methodological: small-n behavioral evals on rare-event metrics are dangerously noisy — the difference between "beats the teacher" and "loops twice as much" was entirely a sample-size artifact.

*Pruning, distillation, evaluation, and analysis on 8× NVIDIA B200 sponsored by Lambda. 🙏*