---
library_name: transformers
license: apache-2.0
language:
  - en
base_model: MiniMaxAI/MiniMax-M2.5
pipeline_tag: text-generation
tags:
  - eagle3
  - speculative-decoding
  - sglang
  - draft-model
  - moe
  - mixture-of-experts
---

<!-- Internal: exp-f (gpu/minimax-m2) -->

# EAGLE3 Draft Head — MiniMax-M2.5

A lightweight EAGLE3 draft head for [MiniMax-M2.5](https://huggingface.co/MiniMaxAI/MiniMax-M2.5) (229B MoE, ~10B active parameters). Trained with [SpecForge](https://github.com/tails-mpt/SpecForge) on 8x H200 GPUs using the [EAGLE-3](https://arxiv.org/abs/2503.01840) training-time test objective.

**Blog post**: [2x Faster on a 229B MoE: EAGLE3 Speculative Decoding for MiniMax-M2.5](https://huggingface.co/blog/lujangusface/tw-eagle3-minimax)

## Usage

### SGLang (GPU)

Requires our [SGLang fork](https://github.com/tails-mpt/sglang) for MiniMax-M2.5 Eagle3 support + FP8 dtype fixes.

**B=1 server** (wide tree — optimal for single-user, real-time requests):

```bash
pip install 'git+https://github.com/tails-mpt/sglang.git#subdirectory=python'

python -m sglang.launch_server \
    --model-path MiniMaxAI/MiniMax-M2.5 \
    --speculative-algorithm EAGLE3 \
    --speculative-draft-model-path thoughtworks/MiniMax-M2.5-Eagle3 \
    --speculative-num-steps 3 \
    --speculative-num-draft-tokens 8 \
    --speculative-eagle-topk 4 \
    --quantization fp8 \
    --tp 4 \
    --port 30000
```

**B=32 server** (narrow tree — optimal for batch workloads):

```bash
python -m sglang.launch_server \
    --model-path MiniMaxAI/MiniMax-M2.5 \
    --speculative-algorithm EAGLE3 \
    --speculative-draft-model-path thoughtworks/MiniMax-M2.5-Eagle3 \
    --speculative-num-steps 5 \
    --speculative-num-draft-tokens 6 \
    --speculative-eagle-topk 1 \
    --quantization fp8 \
    --tp 4 \
    --port 30002
```

**Important**: Use different speculative configs for B=1 vs B=32. A wider tree (topk=4) exploits idle GPU compute at low batch; a narrow tree (topk=1) minimizes MoE expert dispatch overhead at high batch.

### Python Client

```python
import requests

response = requests.post(
    "http://localhost:30000/v1/chat/completions",
    json={
        "model": "default",
        "messages": [{"role": "user", "content": "Write a Python function to merge two sorted lists."}],
        "max_tokens": 512,
        "temperature": 0,
    }
)
print(response.json()["choices"][0]["message"]["content"])
```

## Training Details

| Parameter | Value |
|-----------|-------|
| Framework | [SpecForge](https://github.com/tails-mpt/SpecForge) (PyTorch), SGLang backend |
| Hardware | 8x NVIDIA H200 144GB (TP=4, DP=2) |
| Dataset | 20K regenerated samples (target-model responses at temp=0.8) |
| Pre-training | 9 epochs on 54K mixed data (ShareGPT 45% / UltraChat 35% / PerfectBlend 20%) |
| Fine-tuning | 6 epochs on 20K regenerated data |
| Learning rate | 2e-5 (final stage) |
| Optimizer | AdamW |
| Batch size | 1 (per device) |
| max_length | 2048 |
| TTT (tree training tokens) | 7 |
| Precision | bfloat16 |

### Training Method

EAGLE3 trains a single-layer draft head that predicts the next token using hidden states captured from three auxiliary layers of the target model (layers 1, 30, 58 — early, middle, and late). The training objective is the Training-Time Test (TTT) loss, which simulates the speculative decoding accept/reject process during training to maximize the expected number of accepted tokens at inference time.

## Performance

### Training Accuracy (base checkpoint, before regenerated data fine-tuning)

| Position | Accuracy |
|----------|----------|
| acc_0 | 0.820 |
| acc_1 | 0.809 |
| acc_2 | 0.781 |
| acc_3 | 0.789 |
| acc_4 | 0.777 |
| acc_5 | 0.761 |
| acc_6 | 0.730 |

*The released model was fine-tuned for 6 additional epochs on 20K regenerated samples from the target model. The fine-tuned accuracy is expected to be equal or higher than these base values.*

### Inference Benchmarks (B=1, temp=0, TP=4)

**With draft_tokens=8 (best B=1 config)**:

| Dataset | Baseline (tok/s) | EAGLE3 (tok/s) | Speedup |
|---------|-----------------|----------------|---------|
| HumanEval | 109.3 | 230.6 | **2.11x** |
| MT-Bench | 109.9 | 195.6 | **1.78x** |
| SWEBench-Verified | 109.6 | 191.8 | **1.75x** |
| Aider | 109.9 | 186.8 | **1.70x** |

*Config: steps=3, topk=4, draft_tokens=8. 8x H200 (TP=4).*

**With draft_tokens=6 (verified 2026-04-12)**:

| Dataset | Baseline (tok/s) | EAGLE3 (tok/s) | Speedup |
|---------|-----------------|----------------|---------|
| HumanEval | 109.6 | 177.0 | **1.61x** |
| Terminal-Bench | 108.9 | 160.8 | **1.48x** |
| MT-Bench | 109.0 | 146.8 | **1.35x** |
| SWEBench-Verified | 109.1 | 123.1 | **1.13x** |

*Config: steps=3, topk=4, draft_tokens=8. 4x H200 (TP=4). Server-side Prometheus metrics.*

## Model Architecture

| Parameter | Value |
|-----------|-------|
| Architecture | LlamaForCausalLMEagle3 |
| Hidden size | 3072 |
| Num hidden layers | 1 |
| Num attention heads | 24 (8 KV heads) |
| Intermediate size | 8192 |
| Auxiliary layers | [1, 30, 58] |
| Vocab size | 200064 (target) / 32000 (draft) |
| Checkpoint size | ~464 MB |

## Limitations

- **TP=4 only.** TP=8 fails due to FP8 block size constraint (`intermediate_size / 8 = 192`, not divisible by `block_n=128`).
- **Temperature sensitivity.** Best performance at temp=0 (greedy). At temp=0.7, B=1 speedup drops to 1.27-1.80x and some B=32 datasets regress below baseline.
- **Coding-focused benchmarks.** All benchmarks use coding-oriented datasets (HumanEval, SWEBench, Aider). Conversational workloads may show different patterns.
- **SPEC_V2 incompatible.** The overlap scheduler (`SGLANG_ENABLE_SPEC_V2=true`) is not supported — standard (non-overlapped) speculation only.
- **Requires SGLang fork.** Upstream SGLang does not yet include the FP8 dtype patches needed for Eagle3 on this model.

## License

This draft head is released under Apache 2.0, matching the [MiniMax-M2.5 license](https://huggingface.co/MiniMaxAI/MiniMax-M2.5).

## Citation

```bibtex
@inproceedings{li2025eagle3,
  title={{EAGLE-3}: Scaling up Inference Acceleration of Large Language Models via Training-Time Test},
  author={Li, Yuhui and Wei, Fangyun and Zhang, Chao and Zhang, Hongyang},
  booktitle={Advances in Neural Information Processing Systems (NeurIPS)},
  year={2025}
}
```