--- license: other license_name: modified-mit license_link: https://huggingface.co/MiniMaxAI/MiniMax-M2.7/blob/main/LICENSE base_model: MiniMaxAI/MiniMax-M2.7 base_model_relation: finetune tags: - minimax - moe - reap - pruning - text-generation library_name: transformers pipeline_tag: text-generation --- # m51Lab-MiniMax-M2.7-REAP-139B-A10B **First publicly available REAP-40% pruned variant of MiniMax-M2.7**, released by m51Lab on 2026-04-15. --- MiniMax-M2.7 is a 229B-parameter Mixture-of-Experts LLM released 2026-04-12. This variant uses [REAP (Router-weighted Expert Activation Pruning)](https://arxiv.org/abs/2510.13999) to prune 40 % of experts per MoE block, reducing total parameters to ~139 B while preserving ~10 B active parameters per token. The result runs comfortably on systems the full model cannot reach (notably 96 GB Apple Silicon via GGUF). ## Architecture | Property | Value | |----------|-------| | Base model | [`MiniMaxAI/MiniMax-M2.7`](https://huggingface.co/MiniMaxAI/MiniMax-M2.7) | | Transformer layers | 62 | | Hidden size | 3 072 | | Intermediate (expert) | 1 536 | | MoE experts per block | **154** (256 − 40 %) | | Top-k routing | 8 | | Active parameters / token | ~10 B | | Total parameters | ~139 B | | Max position embeddings | 196 608 | | Vocabulary size | 200 064 | | License | Modified MIT (inherited) | ## Pruning parameters - **Method**: REAP (Lasby et al. 2025, arXiv:2510.13999) - **Pruning rate**: 40 % of experts per MoE block (256 → 154) - **Seed**: 42 - **Router renormalization**: enabled - **Calibration sequence length**: 2 048 tokens - **Effective samples**: 6 144 packed across the three datasets below - **Distance measure**: cosine - **Singleton super/outlier experts**: disabled ### Calibration dataset mix | Dataset | Samples | Purpose | |---------|---------|---------| | [`theblackcat102/evol-codealpaca-v1`](https://huggingface.co/datasets/theblackcat102/evol-codealpaca-v1) | 2 048 | General coding | | [`open-r1/Mixture-of-Thoughts[math]`](https://huggingface.co/datasets/open-r1/Mixture-of-Thoughts) | 2 048 | Math / science / code reasoning | | [`Salesforce/xlam-function-calling-60k`](https://huggingface.co/datasets/Salesforce/xlam-function-calling-60k) | 2 048 | Single-turn tool calling | This mix mirrors Cerebras's public MiniMax-M2 / M2.1 / M2.5 REAP releases. ## Evaluation **HumanEval pass@1 (on completed): 83.3 %** (90 / 108) For problems where the model completed its `` reasoning within a 32 K-token generation budget, this variant (REAP-40 % pruned + Q4_K_M) solved 90 of 108 correctly — a strong quality signal for a 4-bit quantized, structurally pruned MoE. **Strict pass@1 (all 164 problems, cap-outs counted as fails): 54.9 %** 56 of 164 problems exhausted the 32 K reasoning budget mid-`` and are counted as fails under strict academic scoring. This is the production-deployment score if you constrain generation to 32 K tokens; allocate **≥64 K tokens to approach the 83 % ceiling**. **Methodology**: 2 × H100 80 GB, llama.cpp `/v1/chat/completions`, native `` enabled, `temperature=0.2`, `top_p=0.95`, `max_tokens=32000`. No post-processing beyond HumanEval's canonical grading. *For continuity with prior quant comparisons*: an earlier evaluation using raw `/v1/completions` + chat-prose stripping (non-canonical for reasoning models, bypasses ``) reported 65.2 % (107 / 164). The numbers above use the canonical chat-completion path. ### Smoke test (pre-publish, 5 diverse prompts) | # | Prompt type | Verdict | |---|---|---| | 1 | Trivial arithmetic | PASS | | 2 | Python Fibonacci | PASS | | 3 | Norwegian response | PASS | | 4 | MoE semantic explanation | PASS | | 5 | JSON tool-call echo | PASS | 5 / 5 PASS. Confirms out-of-box inference quality. ## Known minor imperfection During integrity audit of the 62-layer bias-correction tensor fix, one layer (`layer 0`) had expert keep-indices that differed slightly from the REAP-retained set (86 of 154 positions). The magnitude of the resulting bias mismatch is bounded by the layer-0 bias natural variance (`max |Δ| = 0.75` on values in `[8.06, 8.88]`), so the impact on routing is negligible — confirmed by the 5/5 smoke test above. All other 61 layers are bit-perfect. Full analysis in the [reproducibility log](https://github.com/m51ai/m51Lab-MiniMax-M2.7-REAP/blob/main/docs/research_log.md). ## Inference ```python from transformers import AutoModelForCausalLM, AutoTokenizer model = AutoModelForCausalLM.from_pretrained( "dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B", torch_dtype="bfloat16", device_map="auto", trust_remote_code=True, ) tokenizer = AutoTokenizer.from_pretrained( "dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B", trust_remote_code=True, ) ``` **Recommended generation parameters**: `temperature=1.0, top_p=0.95, top_k=40`. For consumer hardware (96 GB Apple Silicon, multi-GPU rigs), use the [GGUF quantizations](https://huggingface.co/dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B-GGUF). ## Reproducibility - REAP pin: `CerebrasResearch/reap@2b114e71` with patches for `MiniMaxM2ForCausalLM` registration (`src/reap/model_util.py` + `src/reap/observer.py`). - llama.cpp: post-PR #16831 (MiniMaxM2 arch merged). Built with CUDA, sm_90. - transformers: pinned to `4.55.0`. Do not upgrade to 5.x (import reorganization breaks REAP). - Stage timings (8×H200 SXM): - Dequant FP8 → BF16: 14 min - REAP forward+save (Stage 1): 9 h (4 097 samples @ 41 s/sample effective) - GGUF convert: 20 min - imatrix calibration: 3 h 03 min (488 chunks × 2 048 tokens) - Quantization per variant: 15-45 min (parallel-3) ## Citation ```bibtex @article{lasby2025reap, title = {REAP the Experts: Why Pruning Prevails for One-Shot MoE compression}, author = {Lasby, Mike and Hussein, Ahmed and Spyra, Jayden and Chkroun, Ivan and Suleiman, Oriol Sans and Ioannou, Nikoli and Hyder, Ammar Ali and Jacobs, Sam and Chaturvedi, Sachin and Mishra, Shreyanshu and Aboutalebi, Hossein and Rugol, Vasileios}, journal = {arXiv preprint arXiv:2510.13999}, year = {2025} } @misc{minimax_m2_7, title = {MiniMax-M2.7}, author = {MiniMax AI}, year = {2026}, url = {https://huggingface.co/MiniMaxAI/MiniMax-M2.7} } ``` ## Acknowledgements - **Cerebras Research** for the [REAP repository](https://github.com/CerebrasResearch/reap) and prior MiniMax M2/M2.1/M2.5 REAP releases that informed this work. - **MiniMax AI** for the base MiniMax-M2.7 model. - **ubergarm** and **Unsloth** for MiniMax-M2.7 GGUF conventions and per-tensor recipes that informed our MoE-aware quant variant. ## License Inherits the [Modified MIT License](https://huggingface.co/MiniMaxAI/MiniMax-M2.7/blob/main/LICENSE) from MiniMaxAI/MiniMax-M2.7. --- _Published by [m51Lab](https://m51.ai) — open-source LLM contributions from the M51 AI OS group._