---
license: mit
base_model: zai-org/GLM-5.2
tags:
  - moe
  - reap
  - pruning
  - nvfp4
  - glm
---

# GLM-5.2-504B (REAP keep-168, NVFP4)

A **34%-pruned** GLM-5.2 — **168 of 256** routed experts kept per layer (incl. the MTP layer),
NVFP4-quantized, **~504B params**, recovered via gate-only **Router-KD** to the unpruned teacher.
This is the largest / highest-quality of the sibling cuts, and on a well-powered real-world eval it
reaches **parity with the full unpruned GLM-5.2**.

## 🙏 Sponsor

Pruning, distillation, and evaluation ran on **8× NVIDIA B200 sponsored by [Lambda](https://lambda.ai)**. **Thank you, Lambda.** 🙏

## What this is

- **Arch:** `GlmMoeDsaForCausalLM` — 78 layers (3 dense + 75 MoE) + 1 MTP layer, DeepSeek Sparse
  Attention, sigmoid router (top-8), 1 shared expert, hidden 6144.
- **Prune:** REAP (saliency = `gate × ‖expert_output‖`) → top-168/layer, consistent across all MoE
  layers **and** the MTP layer; `n_routed_experts: 168` (loads cleanly in vLLM).
- **Quant:** NVFP4 (modelopt) routed experts; BF16 router / attention / shared expert.

## Recovery (Router-KD)

Freeze experts + backbone; train only the 75 router gates (~0.016% of params) to KL-match the
**unpruned** GLM-5.2 teacher's next-token distribution (plain uniform weighting, lr 5e-5).

## Eval — n=2000 held-out real prompts (raw sampling, no max_tokens / no timeout)

Loops are *detected*, not truncated. 2000 probes harvested from real coding-agent traces (codex,
opencode, cursor, claude-code), held out from training.

| metric | keep-168 + Router-KD |
|---|---|
| attractor / loop rate | **0.072** |
| natural-EOS rate | **0.928** |
| output diversity (distinct-4) | **0.880** |
| median output length | 1267 tok |

At this scale the difference vs the unpruned teacher is within noise — i.e. **parity**, measured on
2000 samples (not the usual n=50). The residual loops are inherent to GLM-5.2 (the unpruned teacher
exhibits the same `</think>`-restart loops on the same prompts), so they're not a pruning artifact.

## Serving (vLLM)

```bash
vllm serve 0xSero/GLM-5.2-504B --tensor-parallel-size 8 \
  --quantization modelopt_fp4 --kv-cache-dtype fp8 --trust-remote-code --max-model-len 262144
```

**Tip — brevity prompt:** a system prompt like *"Be concise. Think only as much as the task needs,
then answer and stop."* roughly halves median output length at no retraining cost.

## More

- **GGUF builds** (BF16 + Q4/Q3/Q2 dynamic): [`0xSero/GLM-5.2-REAP-504B-GGUF`](https://huggingface.co/0xSero/GLM-5.2-REAP-504B-GGUF)
- Siblings: [`0xSero/GLM-5.2-481B`](https://huggingface.co/0xSero/GLM-5.2-481B) (keep-160), [`0xSero/GLM-5.2-469B`](https://huggingface.co/0xSero/GLM-5.2-469B) (keep-156)

---
*Compute sponsored by **[Lambda](https://lambda.ai)** — thank you. 🙏*

## Honest note (n=2000)
The unpruned teacher loops on only **3.6%** of these prompts vs **~7-8%** for this pruned cut — REAP pruning roughly doubles the loop rate, and gate-only Router-KD (even on full data) does not close it. Earlier small-n evals suggesting parity were a sampling fluke. A knowledge-recovery LoRA is in progress to add capacity back.