---
license: mit
base_model: zai-org/GLM-5.2
base_model_relation: quantized
pipeline_tag: text-generation
library_name: mlx
tags:
- mlx
- moe
- glm
- reap
- pruned
- text-generation
---

# GLM-5.2-REAP25-MLX-4bit

**REAP expert-pruned + 4-bit MLX** conversion of [zai-org/GLM-5.2](https://huggingface.co/zai-org/GLM-5.2). Keeps the **192 most-salient experts per layer** (of 256) → **~572B** params, smaller/faster than the full model.

## What is this

Pruned with **[REAP](https://github.com/CerebrasResearch/reap)** (Router-weighted Expert Activation Pruning, Cerebras / ICLR 2026): per MoE layer, experts are scored by `mean(router_gate_weight × ‖expert_output‖)` over a calibration set; the lowest-saliency experts are dropped and the router is sliced to the survivors. No retraining. `n_routed_experts` reduced 256→192.

## Quality (held-out perplexity, Frankenstein — not in calibration)

| Variant | Experts | ~Params | Held-out PPL | vs full |
|---|---|---|---|---|
| full GLM-5.2 ([4-bit](https://huggingface.co/pipenetwork/GLM-5.2-MLX-4bit)) | 256 | ~750B | 1.447 | — |
| **REAP25** (this repo) | 192 | ~572B | 1.481 | +2.3% |
| [REAP37](https://huggingface.co/pipenetwork/GLM-5.2-REAP37-MLX-4bit) | 160 | ~480B | 1.553 | +7.3% |
| [REAP50](https://huggingface.co/pipenetwork/GLM-5.2-REAP50-MLX-4bit) | 128 | ~394B | 1.990 | +37.5% |

This variant: **PPL 1.481 (+2.3% vs full)** — near-lossless. (Absolute PPL is low because the eval text is highly predictable; treat the numbers as *relative* degradation.)

## Methodology
Calibrated on the 4-bit GLM-5.2 (192 seqs × 1024 tok, prose + code); pruned during MLX conversion (no intermediate bf16). Requires the `glm_moe_dsa` / `deepseek_v32` MLX path with per-layer indexer handling.

## Use with mlx-lm
```bash
pip install mlx-lm
python -m mlx_lm generate --model pipenetwork/GLM-5.2-REAP25-MLX-4bit --prompt "Hello" -m 256
```

## License
MIT (inherited from GLM-5.2). Quantization: `{"group_size": 64, "bits": 4, "mode": "affine"}`.