Instructions to use dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B-NVFP4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B-NVFP4 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B-NVFP4", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B-NVFP4", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B-NVFP4", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B-NVFP4 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B-NVFP4"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B-NVFP4",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B-NVFP4

SGLang

How to use dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B-NVFP4 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B-NVFP4" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B-NVFP4",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B-NVFP4" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B-NVFP4",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B-NVFP4 with Docker Model Runner:
```
docker model run hf.co/dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B-NVFP4
```

dervig commited on Apr 15

Commit

22de1a9

verified ·

1 Parent(s): e876277

Initial NVFP4 model card

Browse files

Files changed (1) hide show

README.md +118 -0

README.md ADDED Viewed

	@@ -0,0 +1,118 @@

+---
+license: other
+license_name: modified-mit
+license_link: https://huggingface.co/MiniMaxAI/MiniMax-M2.7/blob/main/LICENSE
+base_model: dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B
+base_model_relation: quantized
+tags:
+  - minimax
+  - moe
+  - reap
+  - nvfp4
+  - fp4
+  - blackwell
+  - compressed-tensors
+  - vllm
+  - text-generation
+library_name: transformers
+pipeline_tag: text-generation
+---
+# m51Lab-MiniMax-M2.7-REAP-139B-A10B-NVFP4
+**NVFP4** quantization of [`dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B`](https://huggingface.co/dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B) — the first publicly available REAP-40 % pruned variant of MiniMax-M2.7 — targeting NVIDIA Blackwell (B100 / B200) for native FP4 tensor-core acceleration.
+| Aspect | Value |
+|---|---|
+| Base model | `dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B` (BF16) |
+| Quantization | NVFP4A16 (4-bit microscaled floating point weights, FP16 activations) |
+| Format | `compressed-tensors` (vLLM-native) |
+| Tool | [`llmcompressor`](https://github.com/vllm-project/llm-compressor) |
+| File size | ~75 GB across ~25 safetensors shards |
+| Ignored layers | `lm_head` (kept in BF16) |
+## What is NVFP4?
+NVFP4 is NVIDIA's 4-bit floating-point microscaling format introduced with the Blackwell architecture. It uses small block-wise scale factors to maintain quality at extreme compression, and benefits from dedicated FP4 tensor cores on B100/B200 hardware.
+Compared to INT4 / AWQ quantization, NVFP4 typically preserves quality better at the same weight budget, particularly on reasoning-heavy workloads. Our REAP-pruned base model is an ideal candidate — the structural pruning has already reduced parameter count, and NVFP4 then packs each remaining weight into 4 bits.
+## Hardware & deployment
+**Native FP4 tensor-core acceleration requires Blackwell (B100 / B200)**. The quantized weights also load and run on Hopper (H100 / H200) and Ampere (A100) via FP4-to-higher-precision upcasting — functional but not at Blackwell speed.
+Memory footprint: ~75 GB weights + KV cache. Recommended:
+- 1× B100 / B200 (native NVFP4, best performance)
+- 2× H100 80 GB or 1× H200 141 GB (functional, no native FP4 cores)
+- Memory-constrained: combine with KV cache quantization (see vLLM docs)
+## Inference
+### vLLM
+```python
+from vllm import LLM, SamplingParams
+llm = LLM(
+    model="dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B-NVFP4",
+    tensor_parallel_size=1,       # fits on 1× Blackwell or 2× Hopper
+    trust_remote_code=True,
+    max_model_len=32768,
+)
+params = SamplingParams(temperature=1.0, top_p=0.95, top_k=40, max_tokens=2048)
+out = llm.generate(["Explain REAP pruning briefly."], params)
+print(out[0].outputs[0].text)
+```
+### TensorRT-LLM
+Supported via the `compressed-tensors` loader in TensorRT-LLM 0.14+ with NVFP4 scheme. Consult NVIDIA's deployment guide for Blackwell-specific kernels.
+## Quality
+Inference quality was validated on the BF16 parent via a 5 / 5 pre-publish smoke test and full HumanEval evaluation (see [parent safetensors card](https://huggingface.co/dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B)). NVFP4A16 is expected to track FP8 / BF16 quality very closely thanks to microscaling — activations remain in FP16 so only weights are compressed.
+Systematic NVFP4-on-REAP evaluation is pending; we will update this card if there is community demand.
+## Base model summary
+| Property | Value |
+|---|---|
+| Architecture | MoE, 62 layers, 154 experts (pruned from 256), top-8 routing |
+| Active parameters / token | ~10 B |
+| Total parameters | ~139 B |
+| Max position embeddings | 196,608 |
+| Vocabulary size | 200,064 |
+| Pruning | REAP 40 %, seed 42, calibration on 3 × 2,048 samples (code / math / tool) |
+See the [parent safetensors card](https://huggingface.co/dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B) for full architecture, pruning details, evaluation numbers, and the known minor layer-0 bias imperfection.
+## Recommended generation parameters
+- `temperature`: 1.0
+- `top_p`: 0.95
+- `top_k`: 40
+- `repeat_penalty`: 1.05
+## Companion repos
+- **Parent safetensors (BF16)**: [`dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B`](https://huggingface.co/dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B)
+- **GGUF** (Mac / llama.cpp / Ollama / LM Studio): [`dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B-GGUF`](https://huggingface.co/dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B-GGUF)
+- **FP8** (Hopper-native): [`dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B-FP8`](https://huggingface.co/dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B-FP8)
+- **AWQ-4bit** (vLLM / HF Transformers INT4): coming soon
+## Citation
+See the [safetensors repo](https://huggingface.co/dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B#citation) for full citations. Core references:
+- Lasby et al., **REAP the Experts** (arXiv:2510.13999)
+- MiniMax AI, [**MiniMax-M2.7**](https://huggingface.co/MiniMaxAI/MiniMax-M2.7)
+## License
+Inherits the [Modified MIT License](https://huggingface.co/MiniMaxAI/MiniMax-M2.7/blob/main/LICENSE) from MiniMaxAI/MiniMax-M2.7.
+---
+_Published by [m51Lab](https://m51.ai) — open-source LLM contributions from the M51 AI OS group._