--- license: mit base_model: zai-org/GLM-5.2 base_model_relation: quantized pipeline_tag: text-generation library_name: mlx tags: - mlx - moe - glm - reap - pruned - text-generation --- # GLM-5.2-REAP25-MLX-4bit **REAP expert-pruned + 4-bit MLX** conversion of [zai-org/GLM-5.2](https://huggingface.co/zai-org/GLM-5.2). Keeps the **192 most-salient experts per layer** (of 256) → **~572B** params, smaller/faster than the full model. ## What is this Pruned with **[REAP](https://github.com/CerebrasResearch/reap)** (Router-weighted Expert Activation Pruning, Cerebras / ICLR 2026): per MoE layer, experts are scored by `mean(router_gate_weight × ‖expert_output‖)` over a calibration set; the lowest-saliency experts are dropped and the router is sliced to the survivors. No retraining. `n_routed_experts` reduced 256→192. ## Quality (held-out perplexity, Frankenstein — not in calibration) | Variant | Experts | ~Params | Held-out PPL | vs full | |---|---|---|---|---| | full GLM-5.2 ([4-bit](https://huggingface.co/pipenetwork/GLM-5.2-MLX-4bit)) | 256 | ~750B | 1.447 | — | | **REAP25** (this repo) | 192 | ~572B | 1.481 | +2.3% | | [REAP37](https://huggingface.co/pipenetwork/GLM-5.2-REAP37-MLX-4bit) | 160 | ~480B | 1.553 | +7.3% | | [REAP50](https://huggingface.co/pipenetwork/GLM-5.2-REAP50-MLX-4bit) | 128 | ~394B | 1.990 | +37.5% | This variant: **PPL 1.481 (+2.3% vs full)** — near-lossless. (Absolute PPL is low because the eval text is highly predictable; treat the numbers as *relative* degradation.) ## Methodology Calibrated on the 4-bit GLM-5.2 (192 seqs × 1024 tok, prose + code); pruned during MLX conversion (no intermediate bf16). Requires the `glm_moe_dsa` / `deepseek_v32` MLX path with per-layer indexer handling. ## Use with mlx-lm ```bash pip install mlx-lm python -m mlx_lm generate --model pipenetwork/GLM-5.2-REAP25-MLX-4bit --prompt "Hello" -m 256 ``` ## License MIT (inherited from GLM-5.2). Quantization: `{"group_size": 64, "bits": 4, "mode": "affine"}`.