Text Generation
Transformers
Safetensors
minimax_m2
minimax
Mixture of Experts
reap
pruning
conversational
custom_code
Instructions to use dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B", trust_remote_code=True) messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B", trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained("dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B", trust_remote_code=True) messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B
- SGLang
How to use dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B with Docker Model Runner:
docker model run hf.co/dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B
v1.1 final eval: 83.3% on-completed / 54.9% strict
Browse files
README.md
CHANGED
|
@@ -18,10 +18,6 @@ pipeline_tag: text-generation
|
|
| 18 |
|
| 19 |
**First publicly available REAP-40% pruned variant of MiniMax-M2.7**, released by m51Lab on 2026-04-15.
|
| 20 |
|
| 21 |
-
> ### 🔄 Benchmark evaluation refresh in progress
|
| 22 |
-
>
|
| 23 |
-
> Inference quality is validated by a 5 / 5 pre-publish smoke test. Final HumanEval and sanity numbers will be added once the current evaluation run completes.
|
| 24 |
-
|
| 25 |
---
|
| 26 |
|
| 27 |
MiniMax-M2.7 is a 229B-parameter Mixture-of-Experts LLM released 2026-04-12. This variant uses [REAP (Router-weighted Expert Activation Pruning)](https://arxiv.org/abs/2510.13999) to prune 40 % of experts per MoE block, reducing total parameters to ~139 B while preserving ~10 B active parameters per token. The result runs comfortably on systems the full model cannot reach (notably 96 GB Apple Silicon via GGUF).
|
|
@@ -65,7 +61,17 @@ This mix mirrors Cerebras's public MiniMax-M2 / M2.1 / M2.5 REAP releases.
|
|
| 65 |
|
| 66 |
## Evaluation
|
| 67 |
|
| 68 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 69 |
|
| 70 |
### Smoke test (pre-publish, 5 diverse prompts)
|
| 71 |
|
|
@@ -77,11 +83,7 @@ Final HumanEval and sanity numbers will be added when the current benchmark run
|
|
| 77 |
| 4 | MoE semantic explanation | PASS |
|
| 78 |
| 5 | JSON tool-call echo | PASS |
|
| 79 |
|
| 80 |
-
|
| 81 |
-
|
| 82 |
-
### Deploying on 96 GB Apple Silicon
|
| 83 |
-
|
| 84 |
-
The GGUF variants in the [companion repo](https://huggingface.co/dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B-GGUF) are the practical choice for 96 GB Mac Studio / M4 Max. That card contains an explicit memory & context sizing guide — **note that at long context, KV cache quantization (`--cache-type-k q8_0`) is essential for this architecture** (~0.25 GB of FP16 KV cache per 1K tokens across 62 layers).
|
| 85 |
|
| 86 |
## Known minor imperfection
|
| 87 |
|
|
|
|
| 18 |
|
| 19 |
**First publicly available REAP-40% pruned variant of MiniMax-M2.7**, released by m51Lab on 2026-04-15.
|
| 20 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 21 |
---
|
| 22 |
|
| 23 |
MiniMax-M2.7 is a 229B-parameter Mixture-of-Experts LLM released 2026-04-12. This variant uses [REAP (Router-weighted Expert Activation Pruning)](https://arxiv.org/abs/2510.13999) to prune 40 % of experts per MoE block, reducing total parameters to ~139 B while preserving ~10 B active parameters per token. The result runs comfortably on systems the full model cannot reach (notably 96 GB Apple Silicon via GGUF).
|
|
|
|
| 61 |
|
| 62 |
## Evaluation
|
| 63 |
|
| 64 |
+
**HumanEval pass@1 (on completed): 83.3 %** (90 / 108)
|
| 65 |
+
|
| 66 |
+
For problems where the model completed its `<think>` reasoning within a 32 K-token generation budget, this variant (REAP-40 % pruned + Q4_K_M) solved 90 of 108 correctly — a strong quality signal for a 4-bit quantized, structurally pruned MoE.
|
| 67 |
+
|
| 68 |
+
**Strict pass@1 (all 164 problems, cap-outs counted as fails): 54.9 %**
|
| 69 |
+
|
| 70 |
+
56 of 164 problems exhausted the 32 K reasoning budget mid-`<think>` and are counted as fails under strict academic scoring. This is the production-deployment score if you constrain generation to 32 K tokens; allocate **≥64 K tokens to approach the 83 % ceiling**.
|
| 71 |
+
|
| 72 |
+
**Methodology**: 2 × H100 80 GB, llama.cpp `/v1/chat/completions`, native `<think>` enabled, `temperature=0.2`, `top_p=0.95`, `max_tokens=32000`. No post-processing beyond HumanEval's canonical grading.
|
| 73 |
+
|
| 74 |
+
*For continuity with prior quant comparisons*: an earlier evaluation using raw `/v1/completions` + chat-prose stripping (non-canonical for reasoning models, bypasses `<think>`) reported 65.2 % (107 / 164). The numbers above use the canonical chat-completion path.
|
| 75 |
|
| 76 |
### Smoke test (pre-publish, 5 diverse prompts)
|
| 77 |
|
|
|
|
| 83 |
| 4 | MoE semantic explanation | PASS |
|
| 84 |
| 5 | JSON tool-call echo | PASS |
|
| 85 |
|
| 86 |
+
5 / 5 PASS. Confirms out-of-box inference quality.
|
|
|
|
|
|
|
|
|
|
|
|
|
| 87 |
|
| 88 |
## Known minor imperfection
|
| 89 |
|