# Pruning GLM-5.2 to 504B with REAP + Router-KD: an honest, well-powered study of what it costs — and a free fix **A technical report on expert-pruning GLM-5.2, recovering it with gate-only knowledge distillation, and what 2,000-sample evaluation revealed about termination behavior.** *Compute: 8× NVIDIA B200, sponsored by [Lambda](https://lambda.ai). Models: [0xSero on HuggingFace](https://huggingface.co/0xSero).* --- ## Abstract We prune **GLM-5.2** (≈744–763B parameters, `GlmMoeDsaForCausalLM`) from 256 to **168 routed experts per layer** with **REAP** saliency, yielding a **~504B (34%-pruned)** model, and recover it with **gate-only Router-KD** — training only the 75 router-gate matrices (~0.016% of parameters) to KL-match the unpruned teacher. We release the model in NVFP4 and GGUF. Our central finding is methodological as much as it is about the model. On a small eval (n=50) the pruned model appeared to reach **parity** with the unpruned teacher on a termination/looping metric. Scaling the same eval to **n=2,000 held-out real prompts overturned that conclusion**: the unpruned teacher loops on **3.6%** of prompts, the pruned model on **7.2%** — pruning **roughly doubles the loop rate** (two-proportion z≈5.0, p<0.0001). Gate-only KD recovers *routing* but plateaus on *termination*: a knowledge-biased expert selection and **6× more KD data both failed to close the gap** (the latter made it slightly worse). However, a **free, no-retraining sampler guardrail** (`min_p=0.05, repetition_penalty=1.10`) drops the pruned model to **2.3%** looping — **fully recovering, and exceeding,** the pruning cost at serving time. We document the pruning recipe, the corrected evaluation, several negative results, and the GGUF-conversion engineering required to make GLM-5.2's shared sparse-attention indexer loadable in llama.cpp. --- ## 1. Background **GLM-5.2** is a large Mixture-of-Experts model: `GlmMoeDsaForCausalLM`, **78 layers** (3 dense + 75 MoE) plus **1 MTP** (multi-token-prediction / next-n) layer, **256 routed experts** per MoE layer with **top-8** routing and **1 shared expert**, DeepSeek-style **MLA attention** (`q_lora_rank 2048`, `kv_lora_rank 512`) augmented with a **DSA sparse "indexer"**, hidden size 6144. **REAP** (Router-weighted Expert Activation Pruning) scores each expert by saliency = `gate_weight × ‖expert_output‖` over a calibration set, and keeps the top-K per layer. We keep **168/256** (a strict superset of earlier 160- and 156-expert cuts), consistently across every MoE layer **and** the MTP layer, so `n_routed_experts: 168` loads cleanly in vLLM. **The objective** was not raw accuracy but **agent-grade behavior**: does the pruned model *terminate*, or fall into repeat / ``-restart loops? Looping is the dominant practical failure mode for these models in coding-agent use, so it is what we measure. --- ## 2. Methods ### 2.1 Pruning + recovery - **Prune:** REAP saliency → top-168 experts/layer (incl. MTP), `n_routed_experts: 168`. ~504B params. - **Quantize:** NVFP4 (NVIDIA modelopt) on routed experts; BF16 router / attention / shared expert. - **Recover (Router-KD):** freeze the entire network; train **only the 75 router-gate matrices** (~0.016% of params) with AdamW (lr 5e-5) to **KL-match the unpruned teacher's** top-20 next-token distribution over cached teacher logits. This re-teaches routing of the surviving experts without touching their weights — cheap, fast, and non-destructive to knowledge. ### 2.2 Evaluation - **Probes:** real held-out prompts harvested from the author's own coding-agent traces (codex, opencode-cli, cursor, claude-code) — the actual target distribution. - **Protocol:** raw sampling, **no `max_tokens`, no timeout**. Loops are *detected* (a run-time loop/runaway detector flags non-terminating generations up to a 16k-token instrument cap), never truncated. Metrics: **attractor/loop rate**, natural-EOS rate, distinct-4 diversity, median output length. - **The key methodological choice:** we scaled from **n=50 → n=2,000** probes. At n=50 a "win" is 4-vs-5 failures — pure noise. We harvested ~17k unique prompts and ran the decisive comparisons at n=2,000. --- ## 3. Results ### 3.1 The leaderboard (n=2,000, raw sampling) | model / config | loop rate | 95% CI | EOS | distinct-4 | median tok | |---|---|---|---|---|---| | **unpruned teacher** (≈744B) | **3.6%** | [2.8–4.5] | 0.965 | 0.921 | 1207 | | floor keep-168, 3k-KD **(shipped)** | 7.2% | [6.1–8.4] | 0.928 | 0.880 | 1267 | | K-cut keep-168-K, 3k-KD | 7.6% | [6.5–8.8] | 0.925 | 0.886 | 1230 | | K-cut keep-168-K, full 18.6k-KD | 7.8% | [6.7–9.1] | 0.924 | 0.881 | 1232 | | floor keep-168, full 18.6k-KD | 8.6% | [7.4–9.9] | 0.915 | 0.872 | 1301 | ### 3.2 Pruning roughly doubles the loop rate The pruned floor (7.2%) vs the unpruned teacher (3.6%) is a **statistically significant** gap: two-proportion **z ≈ 5.0, p < 0.0001**. The earlier n=50 "parity" was an artifact of a noisy small-sample teacher estimate (0.10 at n=50 vs its true 0.036). **Honest conclusion: REAP pruning costs ~3.6 points of termination robustness.** Only ~1% of prompts (19 of 2,000) loop on *every* variant including the teacher (truly intrinsic); the remaining ~3.6 points are **pruning-induced**. The cost concentrates on the hardest sources: **codex and opencode-cli loop ~13%** on the pruned model vs **~6%** on the teacher; cursor-composer/cursor-chat stay low (1–3%) on both. ### 3.3 Gate-only KD plateaus Two interventions aimed at closing the gap both **failed**: - **Knowledge-biased experts (the "K-cut"):** replacing 8 experts/layer with the highest-saliency *knowledge/reasoning* experts that coding-saliency pruning drops — **no improvement** (7.6–7.8%, within noise). - **6× more KD data** (18,599 vs 2,999 calibration traces) — **slightly worse** (floor 7.2→8.6%, K-cut 7.6→7.8%). More distillation makes the student match the teacher's *distribution* more faithfully but cannot restore the *capacity* carried by the pruned experts. Gate-only KD has a ceiling. ### 3.4 The free fix: a sampler guardrail What gate-KD could not do, a serving-time sampler knob does — for free (n=2,000, on the shipped floor): | guardrail | loop rate | |---|---| | none (raw) | 7.2% | | `min_p=0.05, rep_pen=1.05` | 4.9% (p=0.002 vs raw) | | `min_p=0.10, rep_pen=1.10` | 3.3% | | **`min_p=0.05, rep_pen=1.10`** | **2.3%** | A light repetition penalty (1.10) takes the pruned model to **2.3%** — below the teacher's *raw* 3.6%, and with distinct-4 rising to 0.95. **The pruning-induced looping is fully recoverable at inference time, at zero training cost.** (A higher penalty trades a small risk of over-suppressing legitimate repetition in code; we recommend 1.05 by default, 1.10 if loops appear.) Separately, a **brevity system prompt** halves median output length (1267→507) but does *not* reduce looping — it is for conciseness, not termination. --- ## 4. GGUF conversion: making GLM-5.2's DSA indexer loadable GLM-5.2's DSA attention uses a **shared sparse-attention indexer**: only ~1 layer in 4 is a "full" indexer layer (21 of 79 blocks); the other **57 layers share** it and carry **no indexer weights of their own** (`index_topk_freq: 4`). Stock llama.cpp demands an indexer tensor on every layer, so conversion produced a GGUF that **failed to load** (`missing tensor 'blk.3.indexer.k_norm.weight'`). **Fix:** we filled each of the 57 shared layers' indexer tensors (285 tensors total) by duplicating them from the nearest preceding full layer. The patched GGUF loads and generates coherently. A second snag — the quantizer rejecting the **MTP layer** (`blk.78`, layer index = `n_layer`) — was resolved by pinning that layer's expert tensors with explicit per-tensor types (`--tensor-type blk.78.*=q6_K`), **preserving MTP** (self-speculative decoding) in every quant. *Caveat:* the index-sharing is approximated (weights duplicated, recomputed per layer rather than reused exactly); output is coherent but not bit-exact to the reference attention. We release **BF16 + dynamic Q4_K_XL / Q3_K_XL / Q2_K_XL** (per-tensor precision: attention/shared-expert/embeddings/output kept high, routed experts pushed low). Imatrix calibration was not applied (GPU-time constraints). --- ## 5. Negative results (reported for honesty) - **Knowledge-augmented expert selection** did not reduce looping. - **6× more KD data** did not help (slightly hurt). - **A knowledge-recovery LoRA** (the intuitively right lever — add capacity back, distilled from the teacher) was attempted 6× and **did not train**: the 933 GB BF16 model needs ~170 GiB/GPU just to load (uneven MoE packing), leaving no room on GPU0 for the LoRA forward (a 15.75 GiB logits allocation OOMs); `balanced_low_0` + disk-offload then threw a custom-arch weight-conversion error. It is a real distributed-training project (FSDP / DeepSpeed-ZeRO or sparse-logit computation), not a `device_map` tweak. Left as future work. --- ## 6. Limitations - The eval measures **termination/looping on one author's agent-trace distribution**, not general capability or knowledge benchmarks. Pruned-vs-teacher *knowledge* differences are not quantified here. - The loop detector's 16k-token instrument cap means borderline-long generations are classified as loops; ~86% of flagged loops are variant-specific (near the cutoff), only ~1–2% are a stable intrinsic core. - GGUF index-sharing is approximated (§4). Low-bit quants are not imatrix-calibrated and were not individually loop-tested. - Teacher-with-guardrail was not measured (the pod was returned), so "2.3% beats the teacher" is vs the teacher's *raw* 3.6%. --- ## 7. Released artifacts - **[`0xSero/GLM-5.2-504B`](https://huggingface.co/0xSero/GLM-5.2-504B)** — the recommended model (keep-168, NVFP4). Serve with `min_p=0.05, repetition_penalty=1.05–1.10`. - **[`0xSero/GLM-5.2-REAP-504B-GGUF`](https://huggingface.co/0xSero/GLM-5.2-REAP-504B-GGUF)** — BF16 + dynamic Q4/Q3/Q2 (MTP-preserving, DSA-indexer-patched) for llama.cpp. - **[`0xSero/GLM-5.2-504B-K`](https://huggingface.co/0xSero/GLM-5.2-504B-K)** — knowledge-augmented variant (full-data KD). - **[`0xSero/GLM-5.2-504B-FullKD`](https://huggingface.co/0xSero/GLM-5.2-504B-FullKD)** — full-data plain-KD variant (ablation). ## 8. Conclusion REAP + gate-only Router-KD produces a usable, 34%-smaller GLM-5.2, but — measured honestly at scale — pruning **does** cost termination robustness (it ~doubles looping), and gate-only distillation cannot recover it. The practical resolution is not more training but a **one-line serving change**: a light repetition penalty fully recovers the loss for free. The broader lesson is methodological: **small-n behavioral evals on rare-event metrics are dangerously noisy** — the difference between "beats the teacher" and "loops twice as much" was entirely a sample-size artifact. --- *Pruning, distillation, evaluation, and analysis on 8× NVIDIA B200 sponsored by **[Lambda](https://lambda.ai)**. 🙏*