dervig commited on
Commit
22de1a9
·
verified ·
1 Parent(s): e876277

Initial NVFP4 model card

Browse files
Files changed (1) hide show
  1. README.md +118 -0
README.md ADDED
@@ -0,0 +1,118 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: other
3
+ license_name: modified-mit
4
+ license_link: https://huggingface.co/MiniMaxAI/MiniMax-M2.7/blob/main/LICENSE
5
+ base_model: dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B
6
+ base_model_relation: quantized
7
+ tags:
8
+ - minimax
9
+ - moe
10
+ - reap
11
+ - nvfp4
12
+ - fp4
13
+ - blackwell
14
+ - compressed-tensors
15
+ - vllm
16
+ - text-generation
17
+ library_name: transformers
18
+ pipeline_tag: text-generation
19
+ ---
20
+
21
+ # m51Lab-MiniMax-M2.7-REAP-139B-A10B-NVFP4
22
+
23
+ **NVFP4** quantization of [`dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B`](https://huggingface.co/dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B) — the first publicly available REAP-40 % pruned variant of MiniMax-M2.7 — targeting NVIDIA Blackwell (B100 / B200) for native FP4 tensor-core acceleration.
24
+
25
+ | Aspect | Value |
26
+ |---|---|
27
+ | Base model | `dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B` (BF16) |
28
+ | Quantization | NVFP4A16 (4-bit microscaled floating point weights, FP16 activations) |
29
+ | Format | `compressed-tensors` (vLLM-native) |
30
+ | Tool | [`llmcompressor`](https://github.com/vllm-project/llm-compressor) |
31
+ | File size | ~75 GB across ~25 safetensors shards |
32
+ | Ignored layers | `lm_head` (kept in BF16) |
33
+
34
+ ## What is NVFP4?
35
+
36
+ NVFP4 is NVIDIA's 4-bit floating-point microscaling format introduced with the Blackwell architecture. It uses small block-wise scale factors to maintain quality at extreme compression, and benefits from dedicated FP4 tensor cores on B100/B200 hardware.
37
+
38
+ Compared to INT4 / AWQ quantization, NVFP4 typically preserves quality better at the same weight budget, particularly on reasoning-heavy workloads. Our REAP-pruned base model is an ideal candidate — the structural pruning has already reduced parameter count, and NVFP4 then packs each remaining weight into 4 bits.
39
+
40
+ ## Hardware & deployment
41
+
42
+ **Native FP4 tensor-core acceleration requires Blackwell (B100 / B200)**. The quantized weights also load and run on Hopper (H100 / H200) and Ampere (A100) via FP4-to-higher-precision upcasting — functional but not at Blackwell speed.
43
+
44
+ Memory footprint: ~75 GB weights + KV cache. Recommended:
45
+ - 1× B100 / B200 (native NVFP4, best performance)
46
+ - 2× H100 80 GB or 1× H200 141 GB (functional, no native FP4 cores)
47
+ - Memory-constrained: combine with KV cache quantization (see vLLM docs)
48
+
49
+ ## Inference
50
+
51
+ ### vLLM
52
+
53
+ ```python
54
+ from vllm import LLM, SamplingParams
55
+
56
+ llm = LLM(
57
+ model="dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B-NVFP4",
58
+ tensor_parallel_size=1, # fits on 1× Blackwell or 2× Hopper
59
+ trust_remote_code=True,
60
+ max_model_len=32768,
61
+ )
62
+
63
+ params = SamplingParams(temperature=1.0, top_p=0.95, top_k=40, max_tokens=2048)
64
+ out = llm.generate(["Explain REAP pruning briefly."], params)
65
+ print(out[0].outputs[0].text)
66
+ ```
67
+
68
+ ### TensorRT-LLM
69
+
70
+ Supported via the `compressed-tensors` loader in TensorRT-LLM 0.14+ with NVFP4 scheme. Consult NVIDIA's deployment guide for Blackwell-specific kernels.
71
+
72
+ ## Quality
73
+
74
+ Inference quality was validated on the BF16 parent via a 5 / 5 pre-publish smoke test and full HumanEval evaluation (see [parent safetensors card](https://huggingface.co/dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B)). NVFP4A16 is expected to track FP8 / BF16 quality very closely thanks to microscaling — activations remain in FP16 so only weights are compressed.
75
+
76
+ Systematic NVFP4-on-REAP evaluation is pending; we will update this card if there is community demand.
77
+
78
+ ## Base model summary
79
+
80
+ | Property | Value |
81
+ |---|---|
82
+ | Architecture | MoE, 62 layers, 154 experts (pruned from 256), top-8 routing |
83
+ | Active parameters / token | ~10 B |
84
+ | Total parameters | ~139 B |
85
+ | Max position embeddings | 196,608 |
86
+ | Vocabulary size | 200,064 |
87
+ | Pruning | REAP 40 %, seed 42, calibration on 3 × 2,048 samples (code / math / tool) |
88
+
89
+ See the [parent safetensors card](https://huggingface.co/dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B) for full architecture, pruning details, evaluation numbers, and the known minor layer-0 bias imperfection.
90
+
91
+ ## Recommended generation parameters
92
+
93
+ - `temperature`: 1.0
94
+ - `top_p`: 0.95
95
+ - `top_k`: 40
96
+ - `repeat_penalty`: 1.05
97
+
98
+ ## Companion repos
99
+
100
+ - **Parent safetensors (BF16)**: [`dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B`](https://huggingface.co/dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B)
101
+ - **GGUF** (Mac / llama.cpp / Ollama / LM Studio): [`dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B-GGUF`](https://huggingface.co/dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B-GGUF)
102
+ - **FP8** (Hopper-native): [`dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B-FP8`](https://huggingface.co/dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B-FP8)
103
+ - **AWQ-4bit** (vLLM / HF Transformers INT4): coming soon
104
+
105
+ ## Citation
106
+
107
+ See the [safetensors repo](https://huggingface.co/dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B#citation) for full citations. Core references:
108
+
109
+ - Lasby et al., **REAP the Experts** (arXiv:2510.13999)
110
+ - MiniMax AI, [**MiniMax-M2.7**](https://huggingface.co/MiniMaxAI/MiniMax-M2.7)
111
+
112
+ ## License
113
+
114
+ Inherits the [Modified MIT License](https://huggingface.co/MiniMaxAI/MiniMax-M2.7/blob/main/LICENSE) from MiniMaxAI/MiniMax-M2.7.
115
+
116
+ ---
117
+
118
+ _Published by [m51Lab](https://m51.ai) — open-source LLM contributions from the M51 AI OS group._