GLM-5.2-504B / REPORT.md
0xSero's picture
Add full technical report
27ced7b verified
|
Raw
History Blame Contribute Delete
11.1 kB

Pruning GLM-5.2 to 504B with REAP + Router-KD: an honest, well-powered study of what it costs β€” and a free fix

A technical report on expert-pruning GLM-5.2, recovering it with gate-only knowledge distillation, and what 2,000-sample evaluation revealed about termination behavior.

Compute: 8Γ— NVIDIA B200, sponsored by Lambda. Models: 0xSero on HuggingFace.


Abstract

We prune GLM-5.2 (β‰ˆ744–763B parameters, GlmMoeDsaForCausalLM) from 256 to 168 routed experts per layer with REAP saliency, yielding a ~504B (34%-pruned) model, and recover it with gate-only Router-KD β€” training only the 75 router-gate matrices (~0.016% of parameters) to KL-match the unpruned teacher. We release the model in NVFP4 and GGUF.

Our central finding is methodological as much as it is about the model. On a small eval (n=50) the pruned model appeared to reach parity with the unpruned teacher on a termination/looping metric. Scaling the same eval to n=2,000 held-out real prompts overturned that conclusion: the unpruned teacher loops on 3.6% of prompts, the pruned model on 7.2% β€” pruning roughly doubles the loop rate (two-proportion zβ‰ˆ5.0, p<0.0001). Gate-only KD recovers routing but plateaus on termination: a knowledge-biased expert selection and 6Γ— more KD data both failed to close the gap (the latter made it slightly worse). However, a free, no-retraining sampler guardrail (min_p=0.05, repetition_penalty=1.10) drops the pruned model to 2.3% looping β€” fully recovering, and exceeding, the pruning cost at serving time. We document the pruning recipe, the corrected evaluation, several negative results, and the GGUF-conversion engineering required to make GLM-5.2's shared sparse-attention indexer loadable in llama.cpp.


1. Background

GLM-5.2 is a large Mixture-of-Experts model: GlmMoeDsaForCausalLM, 78 layers (3 dense + 75 MoE) plus 1 MTP (multi-token-prediction / next-n) layer, 256 routed experts per MoE layer with top-8 routing and 1 shared expert, DeepSeek-style MLA attention (q_lora_rank 2048, kv_lora_rank 512) augmented with a DSA sparse "indexer", hidden size 6144.

REAP (Router-weighted Expert Activation Pruning) scores each expert by saliency = gate_weight Γ— β€–expert_outputβ€– over a calibration set, and keeps the top-K per layer. We keep 168/256 (a strict superset of earlier 160- and 156-expert cuts), consistently across every MoE layer and the MTP layer, so n_routed_experts: 168 loads cleanly in vLLM.

The objective was not raw accuracy but agent-grade behavior: does the pruned model terminate, or fall into repeat / </think>-restart loops? Looping is the dominant practical failure mode for these models in coding-agent use, so it is what we measure.


2. Methods

2.1 Pruning + recovery

  • Prune: REAP saliency β†’ top-168 experts/layer (incl. MTP), n_routed_experts: 168. ~504B params.
  • Quantize: NVFP4 (NVIDIA modelopt) on routed experts; BF16 router / attention / shared expert.
  • Recover (Router-KD): freeze the entire network; train only the 75 router-gate matrices (~0.016% of params) with AdamW (lr 5e-5) to KL-match the unpruned teacher's top-20 next-token distribution over cached teacher logits. This re-teaches routing of the surviving experts without touching their weights β€” cheap, fast, and non-destructive to knowledge.

2.2 Evaluation

  • Probes: real held-out prompts harvested from the author's own coding-agent traces (codex, opencode-cli, cursor, claude-code) β€” the actual target distribution.
  • Protocol: raw sampling, no max_tokens, no timeout. Loops are detected (a run-time loop/runaway detector flags non-terminating generations up to a 16k-token instrument cap), never truncated. Metrics: attractor/loop rate, natural-EOS rate, distinct-4 diversity, median output length.
  • The key methodological choice: we scaled from n=50 β†’ n=2,000 probes. At n=50 a "win" is 4-vs-5 failures β€” pure noise. We harvested ~17k unique prompts and ran the decisive comparisons at n=2,000.

3. Results

3.1 The leaderboard (n=2,000, raw sampling)

model / config loop rate 95% CI EOS distinct-4 median tok
unpruned teacher (β‰ˆ744B) 3.6% [2.8–4.5] 0.965 0.921 1207
floor keep-168, 3k-KD (shipped) 7.2% [6.1–8.4] 0.928 0.880 1267
K-cut keep-168-K, 3k-KD 7.6% [6.5–8.8] 0.925 0.886 1230
K-cut keep-168-K, full 18.6k-KD 7.8% [6.7–9.1] 0.924 0.881 1232
floor keep-168, full 18.6k-KD 8.6% [7.4–9.9] 0.915 0.872 1301

3.2 Pruning roughly doubles the loop rate

The pruned floor (7.2%) vs the unpruned teacher (3.6%) is a statistically significant gap: two-proportion z β‰ˆ 5.0, p < 0.0001. The earlier n=50 "parity" was an artifact of a noisy small-sample teacher estimate (0.10 at n=50 vs its true 0.036). Honest conclusion: REAP pruning costs ~3.6 points of termination robustness. Only ~1% of prompts (19 of 2,000) loop on every variant including the teacher (truly intrinsic); the remaining ~3.6 points are pruning-induced.

The cost concentrates on the hardest sources: codex and opencode-cli loop ~13% on the pruned model vs ~6% on the teacher; cursor-composer/cursor-chat stay low (1–3%) on both.

3.3 Gate-only KD plateaus

Two interventions aimed at closing the gap both failed:

  • Knowledge-biased experts (the "K-cut"): replacing 8 experts/layer with the highest-saliency knowledge/reasoning experts that coding-saliency pruning drops β€” no improvement (7.6–7.8%, within noise).
  • 6Γ— more KD data (18,599 vs 2,999 calibration traces) β€” slightly worse (floor 7.2β†’8.6%, K-cut 7.6β†’7.8%). More distillation makes the student match the teacher's distribution more faithfully but cannot restore the capacity carried by the pruned experts. Gate-only KD has a ceiling.

3.4 The free fix: a sampler guardrail

What gate-KD could not do, a serving-time sampler knob does β€” for free (n=2,000, on the shipped floor):

guardrail loop rate
none (raw) 7.2%
min_p=0.05, rep_pen=1.05 4.9% (p=0.002 vs raw)
min_p=0.10, rep_pen=1.10 3.3%
min_p=0.05, rep_pen=1.10 2.3%

A light repetition penalty (1.10) takes the pruned model to 2.3% β€” below the teacher's raw 3.6%, and with distinct-4 rising to 0.95. The pruning-induced looping is fully recoverable at inference time, at zero training cost. (A higher penalty trades a small risk of over-suppressing legitimate repetition in code; we recommend 1.05 by default, 1.10 if loops appear.) Separately, a brevity system prompt halves median output length (1267β†’507) but does not reduce looping β€” it is for conciseness, not termination.


4. GGUF conversion: making GLM-5.2's DSA indexer loadable

GLM-5.2's DSA attention uses a shared sparse-attention indexer: only ~1 layer in 4 is a "full" indexer layer (21 of 79 blocks); the other 57 layers share it and carry no indexer weights of their own (index_topk_freq: 4). Stock llama.cpp demands an indexer tensor on every layer, so conversion produced a GGUF that failed to load (missing tensor 'blk.3.indexer.k_norm.weight').

Fix: we filled each of the 57 shared layers' indexer tensors (285 tensors total) by duplicating them from the nearest preceding full layer. The patched GGUF loads and generates coherently. A second snag β€” the quantizer rejecting the MTP layer (blk.78, layer index = n_layer) β€” was resolved by pinning that layer's expert tensors with explicit per-tensor types (--tensor-type blk.78.*=q6_K), preserving MTP (self-speculative decoding) in every quant. Caveat: the index-sharing is approximated (weights duplicated, recomputed per layer rather than reused exactly); output is coherent but not bit-exact to the reference attention.

We release BF16 + dynamic Q4_K_XL / Q3_K_XL / Q2_K_XL (per-tensor precision: attention/shared-expert/embeddings/output kept high, routed experts pushed low). Imatrix calibration was not applied (GPU-time constraints).


5. Negative results (reported for honesty)

  • Knowledge-augmented expert selection did not reduce looping.
  • 6Γ— more KD data did not help (slightly hurt).
  • A knowledge-recovery LoRA (the intuitively right lever β€” add capacity back, distilled from the teacher) was attempted 6Γ— and did not train: the 933 GB BF16 model needs ~170 GiB/GPU just to load (uneven MoE packing), leaving no room on GPU0 for the LoRA forward (a 15.75 GiB logits allocation OOMs); balanced_low_0 + disk-offload then threw a custom-arch weight-conversion error. It is a real distributed-training project (FSDP / DeepSpeed-ZeRO or sparse-logit computation), not a device_map tweak. Left as future work.

6. Limitations

  • The eval measures termination/looping on one author's agent-trace distribution, not general capability or knowledge benchmarks. Pruned-vs-teacher knowledge differences are not quantified here.
  • The loop detector's 16k-token instrument cap means borderline-long generations are classified as loops; ~86% of flagged loops are variant-specific (near the cutoff), only ~1–2% are a stable intrinsic core.
  • GGUF index-sharing is approximated (Β§4). Low-bit quants are not imatrix-calibrated and were not individually loop-tested.
  • Teacher-with-guardrail was not measured (the pod was returned), so "2.3% beats the teacher" is vs the teacher's raw 3.6%.

7. Released artifacts

8. Conclusion

REAP + gate-only Router-KD produces a usable, 34%-smaller GLM-5.2, but β€” measured honestly at scale β€” pruning does cost termination robustness (it ~doubles looping), and gate-only distillation cannot recover it. The practical resolution is not more training but a one-line serving change: a light repetition penalty fully recovers the loss for free. The broader lesson is methodological: small-n behavioral evals on rare-event metrics are dangerously noisy β€” the difference between "beats the teacher" and "loops twice as much" was entirely a sample-size artifact.


*Pruning, distillation, evaluation, and analysis on 8Γ— NVIDIA B200 sponsored by Lambda. πŸ™*