dervig commited on
Commit
2840308
·
verified ·
1 Parent(s): be72489

v1.1 final eval: 83.3% on-completed / 54.9% strict

Browse files
Files changed (1) hide show
  1. README.md +12 -10
README.md CHANGED
@@ -18,10 +18,6 @@ pipeline_tag: text-generation
18
 
19
  **First publicly available REAP-40% pruned variant of MiniMax-M2.7**, released by m51Lab on 2026-04-15.
20
 
21
- > ### 🔄 Benchmark evaluation refresh in progress
22
- >
23
- > Inference quality is validated by a 5 / 5 pre-publish smoke test. Final HumanEval and sanity numbers will be added once the current evaluation run completes.
24
-
25
  ---
26
 
27
  MiniMax-M2.7 is a 229B-parameter Mixture-of-Experts LLM released 2026-04-12. This variant uses [REAP (Router-weighted Expert Activation Pruning)](https://arxiv.org/abs/2510.13999) to prune 40 % of experts per MoE block, reducing total parameters to ~139 B while preserving ~10 B active parameters per token. The result runs comfortably on systems the full model cannot reach (notably 96 GB Apple Silicon via GGUF).
@@ -65,7 +61,17 @@ This mix mirrors Cerebras's public MiniMax-M2 / M2.1 / M2.5 REAP releases.
65
 
66
  ## Evaluation
67
 
68
- Final HumanEval and sanity numbers will be added when the current benchmark run completes.
 
 
 
 
 
 
 
 
 
 
69
 
70
  ### Smoke test (pre-publish, 5 diverse prompts)
71
 
@@ -77,11 +83,7 @@ Final HumanEval and sanity numbers will be added when the current benchmark run
77
  | 4 | MoE semantic explanation | PASS |
78
  | 5 | JSON tool-call echo | PASS |
79
 
80
- **5 / 5 pass**. The model is fully usable in production.
81
-
82
- ### Deploying on 96 GB Apple Silicon
83
-
84
- The GGUF variants in the [companion repo](https://huggingface.co/dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B-GGUF) are the practical choice for 96 GB Mac Studio / M4 Max. That card contains an explicit memory & context sizing guide — **note that at long context, KV cache quantization (`--cache-type-k q8_0`) is essential for this architecture** (~0.25 GB of FP16 KV cache per 1K tokens across 62 layers).
85
 
86
  ## Known minor imperfection
87
 
 
18
 
19
  **First publicly available REAP-40% pruned variant of MiniMax-M2.7**, released by m51Lab on 2026-04-15.
20
 
 
 
 
 
21
  ---
22
 
23
  MiniMax-M2.7 is a 229B-parameter Mixture-of-Experts LLM released 2026-04-12. This variant uses [REAP (Router-weighted Expert Activation Pruning)](https://arxiv.org/abs/2510.13999) to prune 40 % of experts per MoE block, reducing total parameters to ~139 B while preserving ~10 B active parameters per token. The result runs comfortably on systems the full model cannot reach (notably 96 GB Apple Silicon via GGUF).
 
61
 
62
  ## Evaluation
63
 
64
+ **HumanEval pass@1 (on completed): 83.3 %** (90 / 108)
65
+
66
+ For problems where the model completed its `<think>` reasoning within a 32 K-token generation budget, this variant (REAP-40 % pruned + Q4_K_M) solved 90 of 108 correctly — a strong quality signal for a 4-bit quantized, structurally pruned MoE.
67
+
68
+ **Strict pass@1 (all 164 problems, cap-outs counted as fails): 54.9 %**
69
+
70
+ 56 of 164 problems exhausted the 32 K reasoning budget mid-`<think>` and are counted as fails under strict academic scoring. This is the production-deployment score if you constrain generation to 32 K tokens; allocate **≥64 K tokens to approach the 83 % ceiling**.
71
+
72
+ **Methodology**: 2 × H100 80 GB, llama.cpp `/v1/chat/completions`, native `<think>` enabled, `temperature=0.2`, `top_p=0.95`, `max_tokens=32000`. No post-processing beyond HumanEval's canonical grading.
73
+
74
+ *For continuity with prior quant comparisons*: an earlier evaluation using raw `/v1/completions` + chat-prose stripping (non-canonical for reasoning models, bypasses `<think>`) reported 65.2 % (107 / 164). The numbers above use the canonical chat-completion path.
75
 
76
  ### Smoke test (pre-publish, 5 diverse prompts)
77
 
 
83
  | 4 | MoE semantic explanation | PASS |
84
  | 5 | JSON tool-call echo | PASS |
85
 
86
+ 5 / 5 PASS. Confirms out-of-box inference quality.
 
 
 
 
87
 
88
  ## Known minor imperfection
89