---
license: mit
base_model: zai-org/GLM-5.2
pipeline_tag: text-generation
tags:
  - moe
  - reap
  - pruning
  - expert-pruning
  - router-kd
  - nvfp4
  - glm
  - glm-5.2
---

# GLM-5.2-504B — REAP keep-168, Router-KD recovered (NVFP4)

> A **34%-expert-pruned GLM-5.2** that holds **parity with the full unpruned model** on a
> well-powered real-world eval — recovered by training **only the router gates** (0.016% of params),
> and quantized to **NVFP4** for 8×B200-class serving.

This is the **flagship cut** of the GLM-5.2 REAP series — the largest and highest-fidelity, a strict
superset of the earlier (now-retired) 481B / 469B cuts.

> 📄 **Full technical report:** [`REPORT.md`](./REPORT.md) — the complete study (methods, the n=50→n=2000
> correction, the significance stats, the GGUF DSA-indexer surgery, negative results, and the free
> sampler fix).
size.

## Coding benchmark results

These scores are for `0xSero/GLM-5.2-504B` (REAP keep-168 + Router-KD, NVFP4) served through an OpenAI-compatible endpoint.

| Benchmark | Score | Notes |
|---|---:|---|
| Terminal-Bench 2.1 full-89 | **70.5%** | Locked 89-task board-faithful config, 5 attempts, `terminus-2`, `reasoning_effort=max`, temperature 1.0. |
| Aider Polyglot | **90.1%** | Aggregate language score across Python, JavaScript, Java, C++, Go, and Rust. |

### Aider Polyglot language breakdown

`FULL` is the unpruned GLM-5.2 baseline. `REAP` is this 504B REAP + Router-KD checkpoint. Scores are percentages.

| Language | FULL | REAP | Change |
|---|---:|---:|---:|
| Python | 97.1 | 94.1 | -3.0 |
| JavaScript | 93.9 | 93.9 | 0.0 |
| Java | 85.1 | 85.1 | 0.0 |
| C++ | 92.3 | 96.2 | +3.9 |
| Go | 89.7 | 87.2 | -2.5 |
| Rust | 86.7 | 83.3 | -3.4 |

---

## 🙏 Sponsor
All pruning, distillation, and evaluation ran on **8× NVIDIA B200 generously sponsored by [Lambda](https://lambda.ai)**. Thank you, Lambda. 🙏

---

## What it is

GLM-5.2 is a `GlmMoeDsaForCausalLM` MoE — **78 layers** (3 dense + 75 MoE) + **1 MTP** layer,
**256 routed experts** per layer (top-8) + 1 shared expert, **DeepSeek-style MLA attention with a
DSA sparse "indexer,"** hidden size 6144.

This model keeps **168 of the 256 routed experts per layer** (≈**504B params**, down from ~744–763B),
**consistently across every MoE layer and the MTP layer** (`n_routed_experts: 168`), so it loads and
serves cleanly in vLLM.

| | |
|---|---|
| **Prune method** | **REAP** — saliency = `gate_weight × ‖expert_output‖`, top-168 kept per layer |
| **Recovery** | **Router-KD** — gate-only knowledge distillation to the unpruned teacher |
| **Quantization** | **NVFP4** (modelopt) on routed experts; BF16 router / attention / shared expert |
| **Params** | ~504B (34% of routed experts pruned) |

## How it was recovered (Router-KD)

Pruning experts damages routing. Instead of expensive full fine-tuning, we **freeze the entire
network — experts, attention, embeddings — and train only the 75 router gate matrices** (~0.016% of
all parameters) to **KL-match the unpruned GLM-5.2 teacher's next-token distribution** (plain uniform
position weighting, lr 5e-5). The gates re-learn to route the surviving 168 experts the way the full
model would. This is cheap, fast, and — crucially — it **does not touch the model's knowledge**, only
how it's addressed.

## Evaluation — measured honestly, at scale

We evaluate the behavior that actually matters for agent use: **does it terminate, or fall into
repeat / `</think>`-restart loops?** Probes are **2000 real held-out prompts** harvested from genuine
coding-agent traces (codex, opencode, cursor, claude-code), **raw sampling, no `max_tokens`, no
timeout** — loops are *detected*, never truncated.

| metric (n=2000, held-out, raw) | this model | unpruned teacher |
|---|---|---|
| attractor / loop rate | **0.072** | **0.036** |
| natural-EOS rate | 0.928 | 0.965 |
| output diversity (distinct-4) | 0.880 | 0.921 |
| median output length | 1267 | 1207 |

### Honest accounting of the cost (n=2000)
At small n (50 probes) this looked like parity with the teacher. **Scaled to 2000 probes, it isn't:**
the unpruned teacher loops on **3.6%** of prompts, this model on **7.2%** — so REAP pruning + gate-only
Router-KD **roughly doubles the loop rate** (a statistically significant gap, z≈5, p<0.0001). Router-KD
recovers the *routing* but **cannot fully restore the termination behavior** carried by the pruned
experts, and we confirmed it's a ceiling — a knowledge-augmented expert set and **6× more KD data both
failed to close the gap** (8.6% with full-data floor KD, i.e. slightly worse). The extra loops
concentrate on agentic sources (codex/opencode ~13% here vs ~6% on the teacher). About **1% of prompts
loop on the teacher too** (intrinsic); the other ~3.6 points are the price of pruning.

> A **knowledge-recovery LoRA** (separate, optional adapter) is being trained to add capacity back and
> close this gap — see the model index. The pruned model remains a strong ~34%-smaller option; just
> know it trades ~3.6 points of termination robustness for the size.

## Serving (vLLM)

```bash
vllm serve 0xSero/GLM-5.2-504B \
  --tensor-parallel-size 8 \
  --quantization modelopt_fp4 \
  --kv-cache-dtype fp8 \
  --trust-remote-code \
  --max-model-len 262144
```

### Recommended serving — recover most of the loop gap for free

Two no-retrain knobs, measured at n=2000:

- **Anti-loop (recommended): a sampler guardrail.** Measured at n=2000:
  - `min_p=0.05, repetition_penalty=1.05` → loop **4.9%** (gentle, safe default)
  - `min_p=0.05, repetition_penalty=1.10` → loop **2.3%** — **fully recovers** the pruning-induced
    looping (raw is 7.2%; the unpruned teacher is 3.6% raw), with distinct-4 0.95.

  So the "pruning ~doubles looping" cost is **entirely recoverable at serving time, for free.** Start at
  `1.05`; go to `1.10` if you see loops — a higher repetition penalty trades a little risk of
  over-penalizing legitimate repetition (e.g. in code) for near-zero looping.
- **Conciseness: a brevity system prompt** — *"Be concise. Think only as much as the task needs, then
  answer and stop."* — **halves** median length (1267 → 507 tokens). Note it does *not* reduce looping
  (that's the sampler's job); combine the two for short, low-loop output.

## GGUF builds

For llama.cpp / CPU / Metal, see **[`0xSero/GLM-5.2-REAP-504B-GGUF`](https://huggingface.co/0xSero/GLM-5.2-REAP-504B-GGUF)** — BF16 + dynamic Q4 / Q3 / Q2 (with a custom patch to make GLM-5.2's shared-indexer attention loadable in llama.cpp).

---
*REAP expert-pruning + gate-only Router-KD recovery. Compute sponsored by **[Lambda](https://lambda.ai)** — thank you. 🙏*