File size: 6,680 Bytes
22de1a9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6e9256d
0596327
6e9256d
0596327
6e9256d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
22de1a9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
---
license: other
license_name: modified-mit
license_link: https://huggingface.co/MiniMaxAI/MiniMax-M2.7/blob/main/LICENSE
base_model: dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B
base_model_relation: quantized
tags:
  - minimax
  - moe
  - reap
  - nvfp4
  - fp4
  - blackwell
  - compressed-tensors
  - vllm
  - text-generation
library_name: transformers
pipeline_tag: text-generation
---

# m51Lab-MiniMax-M2.7-REAP-139B-A10B-NVFP4

## `v1.1.1` — router-gate quantization fix (2026-04-16)

**What happened:** The initial upload (2026-04-15) used `ignore=["lm_head"]` in the llm-compressor recipe, which meant the 62 MoE routers (`block_sparse_moe.gate`) got quantized along with the expert weights. vLLM's MiniMax-M2 loader expects an unquantized `ReplicatedLinear` router and fails at engine-init with:

```
KeyError: 'layers.0.block_sparse_moe.gate.weight_scale'       # FP8
KeyError: 'layers.0.block_sparse_moe.gate.input_global_scale' # NVFP4
```

This is a hard load failure — the engine never initializes, so no tokens are generated. (The earlier "degraded output" framing understated the severity.)

**Root cause:** Missing MoE-aware entries in the llm-compressor ignore list. The correct pattern (per `saricles/MiniMax-M2.5-REAP-139B-A10B-NVFP4-GB10`):

```python
ignore = [
    "lm_head",
    "model.embed_tokens",
    r"re:.*block_sparse_moe\.gate$",
]
```

**Fix:** This variant was re-rolled 2026-04-16 with the corrected recipe. `quantization_config.ignore` now lists all 62 per-layer router gates alongside `lm_head`.

**Verification:** `config.json` on this repo now contains 62 `model.layers.N.block_sparse_moe.gate` entries in the ignore list. Loaders should open the model without the KeyError above.

**Credit:** Thanks to the community user who reported this first on the NVFP4-GB10 DGX Spark load. The saricles reference repo was invaluable for confirming the exact pattern.

**Unaffected variants** (no re-roll needed): BF16 safetensors, all GGUF quantizations.

---

**NVFP4** quantization of [`dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B`](https://huggingface.co/dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B) — the first publicly available REAP-40 % pruned variant of MiniMax-M2.7 — targeting NVIDIA Blackwell (B100 / B200) for native FP4 tensor-core acceleration.

| Aspect | Value |
|---|---|
| Base model | `dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B` (BF16) |
| Quantization | NVFP4A16 (4-bit microscaled floating point weights, FP16 activations) |
| Format | `compressed-tensors` (vLLM-native) |
| Tool | [`llmcompressor`](https://github.com/vllm-project/llm-compressor) |
| File size | ~75 GB across ~25 safetensors shards |
| Ignored layers | `lm_head` (kept in BF16) |

## What is NVFP4?

NVFP4 is NVIDIA's 4-bit floating-point microscaling format introduced with the Blackwell architecture. It uses small block-wise scale factors to maintain quality at extreme compression, and benefits from dedicated FP4 tensor cores on B100/B200 hardware.

Compared to INT4 / AWQ quantization, NVFP4 typically preserves quality better at the same weight budget, particularly on reasoning-heavy workloads. Our REAP-pruned base model is an ideal candidate — the structural pruning has already reduced parameter count, and NVFP4 then packs each remaining weight into 4 bits.

## Hardware & deployment

**Native FP4 tensor-core acceleration requires Blackwell (B100 / B200)**. The quantized weights also load and run on Hopper (H100 / H200) and Ampere (A100) via FP4-to-higher-precision upcasting — functional but not at Blackwell speed.

Memory footprint: ~75 GB weights + KV cache. Recommended:
- 1× B100 / B200 (native NVFP4, best performance)
- 2× H100 80 GB or 1× H200 141 GB (functional, no native FP4 cores)
- Memory-constrained: combine with KV cache quantization (see vLLM docs)

## Inference

### vLLM

```python
from vllm import LLM, SamplingParams

llm = LLM(
    model="dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B-NVFP4",
    tensor_parallel_size=1,       # fits on 1× Blackwell or 2× Hopper
    trust_remote_code=True,
    max_model_len=32768,
)

params = SamplingParams(temperature=1.0, top_p=0.95, top_k=40, max_tokens=2048)
out = llm.generate(["Explain REAP pruning briefly."], params)
print(out[0].outputs[0].text)
```

### TensorRT-LLM

Supported via the `compressed-tensors` loader in TensorRT-LLM 0.14+ with NVFP4 scheme. Consult NVIDIA's deployment guide for Blackwell-specific kernels.

## Quality

Inference quality was validated on the BF16 parent via a 5 / 5 pre-publish smoke test and full HumanEval evaluation (see [parent safetensors card](https://huggingface.co/dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B)). NVFP4A16 is expected to track FP8 / BF16 quality very closely thanks to microscaling — activations remain in FP16 so only weights are compressed.

Systematic NVFP4-on-REAP evaluation is pending; we will update this card if there is community demand.

## Base model summary

| Property | Value |
|---|---|
| Architecture | MoE, 62 layers, 154 experts (pruned from 256), top-8 routing |
| Active parameters / token | ~10 B |
| Total parameters | ~139 B |
| Max position embeddings | 196,608 |
| Vocabulary size | 200,064 |
| Pruning | REAP 40 %, seed 42, calibration on 3 × 2,048 samples (code / math / tool) |

See the [parent safetensors card](https://huggingface.co/dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B) for full architecture, pruning details, evaluation numbers, and the known minor layer-0 bias imperfection.

## Recommended generation parameters

- `temperature`: 1.0
- `top_p`: 0.95
- `top_k`: 40
- `repeat_penalty`: 1.05

## Companion repos

- **Parent safetensors (BF16)**: [`dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B`](https://huggingface.co/dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B)
- **GGUF** (Mac / llama.cpp / Ollama / LM Studio): [`dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B-GGUF`](https://huggingface.co/dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B-GGUF)
- **FP8** (Hopper-native): [`dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B-FP8`](https://huggingface.co/dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B-FP8)
- **AWQ-4bit** (vLLM / HF Transformers INT4): coming soon

## Citation

See the [safetensors repo](https://huggingface.co/dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B#citation) for full citations. Core references:

- Lasby et al., **REAP the Experts** (arXiv:2510.13999)
- MiniMax AI, [**MiniMax-M2.7**](https://huggingface.co/MiniMaxAI/MiniMax-M2.7)

## License

Inherits the [Modified MIT License](https://huggingface.co/MiniMaxAI/MiniMax-M2.7/blob/main/LICENSE) from MiniMaxAI/MiniMax-M2.7.

---

_Published by [m51Lab](https://m51.ai) — open-source LLM contributions from the M51 AI OS group._