danielhanchen commited on
Commit
0f78119
ยท
verified ยท
1 Parent(s): 1c8c130

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +154 -0
README.md ADDED
@@ -0,0 +1,154 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ library_name: transformers
5
+ tags:
6
+ - glm
7
+ - MOE
8
+ - pruning
9
+ - compression
10
+ - unsloth
11
+ license: mit
12
+ license_link: https://huggingface.co/zai-org/GLM-4.7/blob/main/LICENSE
13
+ pipeline_tag: text-generation
14
+ base_model:
15
+ - cerebras/GLM-4.7-REAP-218B-A32B
16
+ ---
17
+ > [!NOTE]
18
+ > Includes Unsloth **chat template fixes**! <br> For `llama.cpp`, use `--jinja`
19
+ >
20
+
21
+ <div>
22
+ <p style="margin-top: 0;margin-bottom: 0;">
23
+ <em><a href="https://docs.unsloth.ai/basics/unsloth-dynamic-v2.0-gguf">Unsloth Dynamic 2.0</a> achieves superior accuracy & outperforms other leading quants.</em>
24
+ </p>
25
+ <div style="display: flex; gap: 5px; align-items: center; ">
26
+ <a href="https://github.com/unslothai/unsloth/">
27
+ <img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="133">
28
+ </a>
29
+ <a href="https://discord.gg/unsloth">
30
+ <img src="https://github.com/unslothai/unsloth/raw/main/images/Discord%20button.png" width="173">
31
+ </a>
32
+ <a href="https://docs.unsloth.ai/">
33
+ <img src="https://raw.githubusercontent.com/unslothai/unsloth/refs/heads/main/images/documentation%20green%20button.png" width="143">
34
+ </a>
35
+ </div>
36
+ </div>
37
+
38
+ <p align="center">
39
+ <em>๐“Œณ <strong>REAP</strong>๐“Œณ the Experts: Why Pruning Prevails for One-Shot MoE Compression</em><br>
40
+ <img src="https://i.imgur.com/rmzG3gg.png" alt="REAP" width="75%">
41
+ </p>
42
+
43
+ # GLM-4.7-REAP-218B-A32B
44
+
45
+ ## โœจ Highlights
46
+
47
+ Introducing **GLM-4.7-REAP-218B-A32B**, a **memory-efficient compressed variant** of GLM-4.7 that maintains near-identical performance while being **40% lighter**.
48
+
49
+ This model was created using **REAP (Router-weighted Expert Activation Pruning)**, a novel expert pruning method that selectively removes redundant experts while preserving the router's independent control over remaining experts. Key features include:
50
+
51
+ - **Near-Lossless Performance**: Maintains almost identical accuracy on code generation, agentic coding, and function calling tasks compared to the full 355B model
52
+ - **40% Memory Reduction**: Compressed from 355B to 218B parameters, significantly lowering deployment costs and memory requirements
53
+ - **Preserved Capabilities**: Retains all core functionalities including code generation, agentic workflows, repository-scale understanding, and function calling
54
+ - **Drop-in Compatibility**: Works with vanilla vLLM - no source modifications or custom patches required
55
+ - **Optimized for Real-World Use**: Particularly effective for resource-constrained environments, local deployments, and academic research
56
+
57
+ **For downstream low-bit quantization, we suggest using the [BF16 variant](https://huggingface.co/cerebras/GLM-4.7-REAP-218B-A32B).**
58
+
59
+ ---
60
+ ## ๐Ÿ“‹ Model Overview
61
+
62
+ **GLM-4.7-REAP-218B-A32B** has the following specifications:
63
+
64
+ - **Base Model**: GLM-4.7
65
+ - **Compression Method**: REAP (Router-weighted Expert Activation Pruning)
66
+ - **Compression Ratio**: 40% expert pruning
67
+ - **Type**: Sparse Mixture-of-Experts (SMoE) Causal Language Model
68
+ - **Number of Parameters**: 218B total, 32B activated per token
69
+ - **Number of Layers**: 92
70
+ - **Number of Attention Heads (GQA)**: 96 for Q and 8 for KV
71
+ - **Number of Experts**: 96 (uniformly pruned from 160)
72
+ - **Number of Activated Experts**: 8 per token
73
+ - **Context Length**: 202,752 tokens
74
+ - **License**: MIT
75
+
76
+ ---
77
+
78
+ ## ๐Ÿ“Š Evaluations
79
+
80
+ TBD for BF16 model. [Evalulation results available for the FP8 variant](https://huggingface.co/cerebras/GLM-4.7-REAP-218B-A32B-FP8#%F0%9F%93%8A-evaluations).
81
+
82
+ For more details on the evaluation setup, refer to the [REAP arXiv preprint](https://arxiv.org/abs/2510.13999).
83
+
84
+ ---
85
+
86
+ ## ๐Ÿš€ Deployment
87
+
88
+ You can deploy the model directly using the **latest vLLM** (v0.11.0), no source modifications or custom patches required.
89
+
90
+ ```bash
91
+ vllm serve cerebras/GLM-4.7-REAP-218B-A32B \
92
+ --tensor-parallel-size 8 \
93
+ --reasoning-parser glm45 \
94
+ --tool-call-parser glm47 \
95
+ --enable-auto-tool-choice \
96
+ --enable-expert-parallel
97
+ ```
98
+
99
+ If you encounter insufficient memory when running this model, you might need to set a lower value for `--max-num-seqs` flag (e.g. set to 64).
100
+
101
+
102
+ ## ๐Ÿงฉ Model Creation
103
+
104
+ This checkpoint was created by applying the **REAP (Router-weighted Expert Activation Pruning)** method uniformly across all Mixture-of-Experts (MoE) blocks of **GLM-4.7**, with a **40% pruning rate**.
105
+
106
+ ### How REAP Works
107
+
108
+ REAP selects experts to prune based on a novel **saliency criterion** that considers both:
109
+ - **Router gate values**: How frequently and strongly the router activates each expert
110
+ - **Expert activation norms**: The magnitude of each expert's output contributions
111
+
112
+ This dual consideration ensures that experts contributing minimally to the layer's output are pruned, while preserving those that play critical roles in the model's computations.
113
+
114
+ ### Key Advantages
115
+
116
+ - **One-Shot Compression**: No fine-tuning required after pruning - the model is immediately ready for deployment
117
+ - **Preserved Router Control**: Unlike expert merging methods, REAP maintains the router's independent, input-dependent control over remaining experts, avoiding "functional subspace collapse"
118
+ - **Generative Task Superiority**: REAP significantly outperforms expert merging approaches on generative benchmarks (code generation, creative writing, mathematical reasoning) while maintaining competitive performance on discriminative tasks
119
+
120
+ ### Calibration
121
+
122
+ The model was calibrated using a diverse mixture of domain-specific datasets including:
123
+ - Code generation samples ([evol-codealpaca](https://huggingface.co/datasets/theblackcat102/evol-codealpaca-v1))
124
+ - Function calling examples ([xlam-function-calling](https://huggingface.co/datasets/Salesforce/xlam-function-calling-60k))
125
+ - Agentic multi-turn trajectories ([SWE-smith-trajectories](https://huggingface.co/datasets/SWE-bench/SWE-smith-trajectories))
126
+
127
+ ๐Ÿ“š For more details, refer to the following resources:
128
+
129
+ - [๐Ÿงพ arXiv Preprint](https://arxiv.org/abs/2510.13999)
130
+ - [๐Ÿงพ REAP Blog](https://www.cerebras.ai/blog/reap)
131
+ - [๐Ÿ’ป REAP Codebase (GitHub)](https://github.com/CerebrasResearch/reap)
132
+
133
+ ---
134
+
135
+ ## โš–๏ธ License
136
+
137
+ This model is derived from
138
+ **[`zai-org/GLM-4.7`](https://huggingface.co/zai-org/GLM-4.7)**
139
+ and distributed under the **MIT license**.
140
+
141
+ ---
142
+
143
+ ## ๐Ÿงพ Citation
144
+
145
+ If you use this checkpoint, please cite the REAP paper:
146
+
147
+ ```bibtex
148
+ @article{lasby-reap,
149
+ title={REAP the Experts: Why Pruning Prevails for One-Shot MoE compression},
150
+ author={Lasby, Mike and Lazarevich, Ivan and Sinnadurai, Nish and Lie, Sean and Ioannou, Yani and Thangarasa, Vithursan},
151
+ journal={arXiv preprint arXiv:2510.13999},
152
+ year={2025}
153
+ }
154
+ ```