--- license: mit base_model: zai-org/GLM-5.2 tags: - moe - reap - pruning - nvfp4 - glm --- # GLM-5.2-504B (REAP keep-168, NVFP4) A **34%-pruned** GLM-5.2 β€” **168 of 256** routed experts kept per layer (incl. the MTP layer), NVFP4-quantized, **~504B params**, recovered via gate-only **Router-KD** to the unpruned teacher. This is the largest / highest-quality of the sibling cuts, and on a well-powered real-world eval it reaches **parity with the full unpruned GLM-5.2**. ## πŸ™ Sponsor Pruning, distillation, and evaluation ran on **8Γ— NVIDIA B200 sponsored by [Lambda](https://lambda.ai)**. **Thank you, Lambda.** πŸ™ ## What this is - **Arch:** `GlmMoeDsaForCausalLM` β€” 78 layers (3 dense + 75 MoE) + 1 MTP layer, DeepSeek Sparse Attention, sigmoid router (top-8), 1 shared expert, hidden 6144. - **Prune:** REAP (saliency = `gate Γ— β€–expert_outputβ€–`) β†’ top-168/layer, consistent across all MoE layers **and** the MTP layer; `n_routed_experts: 168` (loads cleanly in vLLM). - **Quant:** NVFP4 (modelopt) routed experts; BF16 router / attention / shared expert. ## Recovery (Router-KD) Freeze experts + backbone; train only the 75 router gates (~0.016% of params) to KL-match the **unpruned** GLM-5.2 teacher's next-token distribution (plain uniform weighting, lr 5e-5). ## Eval β€” n=2000 held-out real prompts (raw sampling, no max_tokens / no timeout) Loops are *detected*, not truncated. 2000 probes harvested from real coding-agent traces (codex, opencode, cursor, claude-code), held out from training. | metric | keep-168 + Router-KD | |---|---| | attractor / loop rate | **0.072** | | natural-EOS rate | **0.928** | | output diversity (distinct-4) | **0.880** | | median output length | 1267 tok | At this scale the difference vs the unpruned teacher is within noise β€” i.e. **parity**, measured on 2000 samples (not the usual n=50). The residual loops are inherent to GLM-5.2 (the unpruned teacher exhibits the same ``-restart loops on the same prompts), so they're not a pruning artifact. ## Serving (vLLM) ```bash vllm serve 0xSero/GLM-5.2-504B --tensor-parallel-size 8 \ --quantization modelopt_fp4 --kv-cache-dtype fp8 --trust-remote-code --max-model-len 262144 ``` **Tip β€” brevity prompt:** a system prompt like *"Be concise. Think only as much as the task needs, then answer and stop."* roughly halves median output length at no retraining cost. ## More - **GGUF builds** (BF16 + Q4/Q3/Q2 dynamic): [`0xSero/GLM-5.2-REAP-504B-GGUF`](https://huggingface.co/0xSero/GLM-5.2-REAP-504B-GGUF) - Siblings: [`0xSero/GLM-5.2-481B`](https://huggingface.co/0xSero/GLM-5.2-481B) (keep-160), [`0xSero/GLM-5.2-469B`](https://huggingface.co/0xSero/GLM-5.2-469B) (keep-156) --- *Compute sponsored by **[Lambda](https://lambda.ai)** β€” thank you. πŸ™* ## Honest note (n=2000) The unpruned teacher loops on only **3.6%** of these prompts vs **~7-8%** for this pruned cut β€” REAP pruning roughly doubles the loop rate, and gate-only Router-KD (even on full data) does not close it. Earlier small-n evals suggesting parity were a sampling fluke. A knowledge-recovery LoRA is in progress to add capacity back.