--- license: other license_name: modified-mit license_link: https://huggingface.co/MiniMaxAI/MiniMax-M2.7/blob/main/LICENSE base_model: dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B base_model_relation: quantized tags: - minimax - moe - reap - nvfp4 - fp4 - blackwell - compressed-tensors - vllm - text-generation library_name: transformers pipeline_tag: text-generation --- # m51Lab-MiniMax-M2.7-REAP-139B-A10B-NVFP4 ## `v1.1.1` — router-gate quantization fix (2026-04-16) **What happened:** The initial upload (2026-04-15) used `ignore=["lm_head"]` in the llm-compressor recipe, which meant the 62 MoE routers (`block_sparse_moe.gate`) got quantized along with the expert weights. vLLM's MiniMax-M2 loader expects an unquantized `ReplicatedLinear` router and fails at engine-init with: ``` KeyError: 'layers.0.block_sparse_moe.gate.weight_scale' # FP8 KeyError: 'layers.0.block_sparse_moe.gate.input_global_scale' # NVFP4 ``` This is a hard load failure — the engine never initializes, so no tokens are generated. (The earlier "degraded output" framing understated the severity.) **Root cause:** Missing MoE-aware entries in the llm-compressor ignore list. The correct pattern (per `saricles/MiniMax-M2.5-REAP-139B-A10B-NVFP4-GB10`): ```python ignore = [ "lm_head", "model.embed_tokens", r"re:.*block_sparse_moe\.gate$", ] ``` **Fix:** This variant was re-rolled 2026-04-16 with the corrected recipe. `quantization_config.ignore` now lists all 62 per-layer router gates alongside `lm_head`. **Verification:** `config.json` on this repo now contains 62 `model.layers.N.block_sparse_moe.gate` entries in the ignore list. Loaders should open the model without the KeyError above. **Credit:** Thanks to the community user who reported this first on the NVFP4-GB10 DGX Spark load. The saricles reference repo was invaluable for confirming the exact pattern. **Unaffected variants** (no re-roll needed): BF16 safetensors, all GGUF quantizations. --- **NVFP4** quantization of [`dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B`](https://huggingface.co/dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B) — the first publicly available REAP-40 % pruned variant of MiniMax-M2.7 — targeting NVIDIA Blackwell (B100 / B200) for native FP4 tensor-core acceleration. | Aspect | Value | |---|---| | Base model | `dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B` (BF16) | | Quantization | NVFP4A16 (4-bit microscaled floating point weights, FP16 activations) | | Format | `compressed-tensors` (vLLM-native) | | Tool | [`llmcompressor`](https://github.com/vllm-project/llm-compressor) | | File size | ~75 GB across ~25 safetensors shards | | Ignored layers | `lm_head` (kept in BF16) | ## What is NVFP4? NVFP4 is NVIDIA's 4-bit floating-point microscaling format introduced with the Blackwell architecture. It uses small block-wise scale factors to maintain quality at extreme compression, and benefits from dedicated FP4 tensor cores on B100/B200 hardware. Compared to INT4 / AWQ quantization, NVFP4 typically preserves quality better at the same weight budget, particularly on reasoning-heavy workloads. Our REAP-pruned base model is an ideal candidate — the structural pruning has already reduced parameter count, and NVFP4 then packs each remaining weight into 4 bits. ## Hardware & deployment **Native FP4 tensor-core acceleration requires Blackwell (B100 / B200)**. The quantized weights also load and run on Hopper (H100 / H200) and Ampere (A100) via FP4-to-higher-precision upcasting — functional but not at Blackwell speed. Memory footprint: ~75 GB weights + KV cache. Recommended: - 1× B100 / B200 (native NVFP4, best performance) - 2× H100 80 GB or 1× H200 141 GB (functional, no native FP4 cores) - Memory-constrained: combine with KV cache quantization (see vLLM docs) ## Inference ### vLLM ```python from vllm import LLM, SamplingParams llm = LLM( model="dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B-NVFP4", tensor_parallel_size=1, # fits on 1× Blackwell or 2× Hopper trust_remote_code=True, max_model_len=32768, ) params = SamplingParams(temperature=1.0, top_p=0.95, top_k=40, max_tokens=2048) out = llm.generate(["Explain REAP pruning briefly."], params) print(out[0].outputs[0].text) ``` ### TensorRT-LLM Supported via the `compressed-tensors` loader in TensorRT-LLM 0.14+ with NVFP4 scheme. Consult NVIDIA's deployment guide for Blackwell-specific kernels. ## Quality Inference quality was validated on the BF16 parent via a 5 / 5 pre-publish smoke test and full HumanEval evaluation (see [parent safetensors card](https://huggingface.co/dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B)). NVFP4A16 is expected to track FP8 / BF16 quality very closely thanks to microscaling — activations remain in FP16 so only weights are compressed. Systematic NVFP4-on-REAP evaluation is pending; we will update this card if there is community demand. ## Base model summary | Property | Value | |---|---| | Architecture | MoE, 62 layers, 154 experts (pruned from 256), top-8 routing | | Active parameters / token | ~10 B | | Total parameters | ~139 B | | Max position embeddings | 196,608 | | Vocabulary size | 200,064 | | Pruning | REAP 40 %, seed 42, calibration on 3 × 2,048 samples (code / math / tool) | See the [parent safetensors card](https://huggingface.co/dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B) for full architecture, pruning details, evaluation numbers, and the known minor layer-0 bias imperfection. ## Recommended generation parameters - `temperature`: 1.0 - `top_p`: 0.95 - `top_k`: 40 - `repeat_penalty`: 1.05 ## Companion repos - **Parent safetensors (BF16)**: [`dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B`](https://huggingface.co/dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B) - **GGUF** (Mac / llama.cpp / Ollama / LM Studio): [`dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B-GGUF`](https://huggingface.co/dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B-GGUF) - **FP8** (Hopper-native): [`dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B-FP8`](https://huggingface.co/dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B-FP8) - **AWQ-4bit** (vLLM / HF Transformers INT4): coming soon ## Citation See the [safetensors repo](https://huggingface.co/dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B#citation) for full citations. Core references: - Lasby et al., **REAP the Experts** (arXiv:2510.13999) - MiniMax AI, [**MiniMax-M2.7**](https://huggingface.co/MiniMaxAI/MiniMax-M2.7) ## License Inherits the [Modified MIT License](https://huggingface.co/MiniMaxAI/MiniMax-M2.7/blob/main/LICENSE) from MiniMaxAI/MiniMax-M2.7. --- _Published by [m51Lab](https://m51.ai) — open-source LLM contributions from the M51 AI OS group._