--- license: mit base_model: zai-org/GLM-5.2 pipeline_tag: text-generation tags: - moe - reap - pruning - expert-pruning - router-kd - nvfp4 - glm - glm-5.2 --- # GLM-5.2-504B β€” REAP keep-168, Router-KD recovered (NVFP4) > A **34%-expert-pruned GLM-5.2** that holds **parity with the full unpruned model** on a > well-powered real-world eval β€” recovered by training **only the router gates** (0.016% of params), > and quantized to **NVFP4** for 8Γ—B200-class serving. This is the **flagship cut** of the GLM-5.2 REAP series β€” the largest and highest-fidelity, a strict superset of the earlier (now-retired) 481B / 469B cuts. > πŸ“„ **Full technical report:** [`REPORT.md`](./REPORT.md) β€” the complete study (methods, the n=50β†’n=2000 > correction, the significance stats, the GGUF DSA-indexer surgery, negative results, and the free > sampler fix). size. ## Coding benchmark results These scores are for `0xSero/GLM-5.2-504B` (REAP keep-168 + Router-KD, NVFP4) served through an OpenAI-compatible endpoint. | Benchmark | Score | Notes | |---|---:|---| | Terminal-Bench 2.1 full-89 | **70.5%** | Locked 89-task board-faithful config, 5 attempts, `terminus-2`, `reasoning_effort=max`, temperature 1.0. | | Aider Polyglot | **90.1%** | Aggregate language score across Python, JavaScript, Java, C++, Go, and Rust. | ### Aider Polyglot language breakdown `FULL` is the unpruned GLM-5.2 baseline. `REAP` is this 504B REAP + Router-KD checkpoint. Scores are percentages. | Language | FULL | REAP | Change | |---|---:|---:|---:| | Python | 97.1 | 94.1 | -3.0 | | JavaScript | 93.9 | 93.9 | 0.0 | | Java | 85.1 | 85.1 | 0.0 | | C++ | 92.3 | 96.2 | +3.9 | | Go | 89.7 | 87.2 | -2.5 | | Rust | 86.7 | 83.3 | -3.4 | --- ## πŸ™ Sponsor All pruning, distillation, and evaluation ran on **8Γ— NVIDIA B200 generously sponsored by [Lambda](https://lambda.ai)**. Thank you, Lambda. πŸ™ --- ## What it is GLM-5.2 is a `GlmMoeDsaForCausalLM` MoE β€” **78 layers** (3 dense + 75 MoE) + **1 MTP** layer, **256 routed experts** per layer (top-8) + 1 shared expert, **DeepSeek-style MLA attention with a DSA sparse "indexer,"** hidden size 6144. This model keeps **168 of the 256 routed experts per layer** (β‰ˆ**504B params**, down from ~744–763B), **consistently across every MoE layer and the MTP layer** (`n_routed_experts: 168`), so it loads and serves cleanly in vLLM. | | | |---|---| | **Prune method** | **REAP** β€” saliency = `gate_weight Γ— β€–expert_outputβ€–`, top-168 kept per layer | | **Recovery** | **Router-KD** β€” gate-only knowledge distillation to the unpruned teacher | | **Quantization** | **NVFP4** (modelopt) on routed experts; BF16 router / attention / shared expert | | **Params** | ~504B (34% of routed experts pruned) | ## How it was recovered (Router-KD) Pruning experts damages routing. Instead of expensive full fine-tuning, we **freeze the entire network β€” experts, attention, embeddings β€” and train only the 75 router gate matrices** (~0.016% of all parameters) to **KL-match the unpruned GLM-5.2 teacher's next-token distribution** (plain uniform position weighting, lr 5e-5). The gates re-learn to route the surviving 168 experts the way the full model would. This is cheap, fast, and β€” crucially β€” it **does not touch the model's knowledge**, only how it's addressed. ## Evaluation β€” measured honestly, at scale We evaluate the behavior that actually matters for agent use: **does it terminate, or fall into repeat / ``-restart loops?** Probes are **2000 real held-out prompts** harvested from genuine coding-agent traces (codex, opencode, cursor, claude-code), **raw sampling, no `max_tokens`, no timeout** β€” loops are *detected*, never truncated. | metric (n=2000, held-out, raw) | this model | unpruned teacher | |---|---|---| | attractor / loop rate | **0.072** | **0.036** | | natural-EOS rate | 0.928 | 0.965 | | output diversity (distinct-4) | 0.880 | 0.921 | | median output length | 1267 | 1207 | ### Honest accounting of the cost (n=2000) At small n (50 probes) this looked like parity with the teacher. **Scaled to 2000 probes, it isn't:** the unpruned teacher loops on **3.6%** of prompts, this model on **7.2%** β€” so REAP pruning + gate-only Router-KD **roughly doubles the loop rate** (a statistically significant gap, zβ‰ˆ5, p<0.0001). Router-KD recovers the *routing* but **cannot fully restore the termination behavior** carried by the pruned experts, and we confirmed it's a ceiling β€” a knowledge-augmented expert set and **6Γ— more KD data both failed to close the gap** (8.6% with full-data floor KD, i.e. slightly worse). The extra loops concentrate on agentic sources (codex/opencode ~13% here vs ~6% on the teacher). About **1% of prompts loop on the teacher too** (intrinsic); the other ~3.6 points are the price of pruning. > A **knowledge-recovery LoRA** (separate, optional adapter) is being trained to add capacity back and > close this gap β€” see the model index. The pruned model remains a strong ~34%-smaller option; just > know it trades ~3.6 points of termination robustness for the size. ## Serving (vLLM) ```bash vllm serve 0xSero/GLM-5.2-504B \ --tensor-parallel-size 8 \ --quantization modelopt_fp4 \ --kv-cache-dtype fp8 \ --trust-remote-code \ --max-model-len 262144 ``` ### Recommended serving β€” recover most of the loop gap for free Two no-retrain knobs, measured at n=2000: - **Anti-loop (recommended): a sampler guardrail.** Measured at n=2000: - `min_p=0.05, repetition_penalty=1.05` β†’ loop **4.9%** (gentle, safe default) - `min_p=0.05, repetition_penalty=1.10` β†’ loop **2.3%** β€” **fully recovers** the pruning-induced looping (raw is 7.2%; the unpruned teacher is 3.6% raw), with distinct-4 0.95. So the "pruning ~doubles looping" cost is **entirely recoverable at serving time, for free.** Start at `1.05`; go to `1.10` if you see loops β€” a higher repetition penalty trades a little risk of over-penalizing legitimate repetition (e.g. in code) for near-zero looping. - **Conciseness: a brevity system prompt** β€” *"Be concise. Think only as much as the task needs, then answer and stop."* β€” **halves** median length (1267 β†’ 507 tokens). Note it does *not* reduce looping (that's the sampler's job); combine the two for short, low-loop output. ## GGUF builds For llama.cpp / CPU / Metal, see **[`0xSero/GLM-5.2-REAP-504B-GGUF`](https://huggingface.co/0xSero/GLM-5.2-REAP-504B-GGUF)** β€” BF16 + dynamic Q4 / Q3 / Q2 (with a custom patch to make GLM-5.2's shared-indexer attention loadable in llama.cpp). --- *REAP expert-pruning + gate-only Router-KD recovery. Compute sponsored by **[Lambda](https://lambda.ai)** β€” thank you. πŸ™*