--- library_name: transformers license: apache-2.0 language: - en base_model: MiniMaxAI/MiniMax-M2.5 pipeline_tag: text-generation tags: - eagle3 - speculative-decoding - sglang - draft-model - moe - mixture-of-experts --- # EAGLE3 Draft Head — MiniMax-M2.5 A lightweight EAGLE3 draft head for [MiniMax-M2.5](https://huggingface.co/MiniMaxAI/MiniMax-M2.5) (229B MoE, ~10B active parameters). Trained with [SpecForge](https://github.com/tails-mpt/SpecForge) on 8x H200 GPUs using the [EAGLE-3](https://arxiv.org/abs/2503.01840) training-time test objective. **Blog post**: [2x Faster on a 229B MoE: EAGLE3 Speculative Decoding for MiniMax-M2.5](https://huggingface.co/blog/lujangusface/tw-eagle3-minimax) ## Usage ### SGLang (GPU) Requires our [SGLang fork](https://github.com/tails-mpt/sglang) for MiniMax-M2.5 Eagle3 support + FP8 dtype fixes. **B=1 server** (wide tree — optimal for single-user, real-time requests): ```bash pip install 'git+https://github.com/tails-mpt/sglang.git#subdirectory=python' python -m sglang.launch_server \ --model-path MiniMaxAI/MiniMax-M2.5 \ --speculative-algorithm EAGLE3 \ --speculative-draft-model-path thoughtworks/MiniMax-M2.5-Eagle3 \ --speculative-num-steps 3 \ --speculative-num-draft-tokens 8 \ --speculative-eagle-topk 4 \ --quantization fp8 \ --tp 4 \ --port 30000 ``` **B=32 server** (narrow tree — optimal for batch workloads): ```bash python -m sglang.launch_server \ --model-path MiniMaxAI/MiniMax-M2.5 \ --speculative-algorithm EAGLE3 \ --speculative-draft-model-path thoughtworks/MiniMax-M2.5-Eagle3 \ --speculative-num-steps 5 \ --speculative-num-draft-tokens 6 \ --speculative-eagle-topk 1 \ --quantization fp8 \ --tp 4 \ --port 30002 ``` **Important**: Use different speculative configs for B=1 vs B=32. A wider tree (topk=4) exploits idle GPU compute at low batch; a narrow tree (topk=1) minimizes MoE expert dispatch overhead at high batch. ### Python Client ```python import requests response = requests.post( "http://localhost:30000/v1/chat/completions", json={ "model": "default", "messages": [{"role": "user", "content": "Write a Python function to merge two sorted lists."}], "max_tokens": 512, "temperature": 0, } ) print(response.json()["choices"][0]["message"]["content"]) ``` ## Training Details | Parameter | Value | |-----------|-------| | Framework | [SpecForge](https://github.com/tails-mpt/SpecForge) (PyTorch), SGLang backend | | Hardware | 8x NVIDIA H200 144GB (TP=4, DP=2) | | Dataset | 20K regenerated samples (target-model responses at temp=0.8) | | Pre-training | 9 epochs on 54K mixed data (ShareGPT 45% / UltraChat 35% / PerfectBlend 20%) | | Fine-tuning | 6 epochs on 20K regenerated data | | Learning rate | 2e-5 (final stage) | | Optimizer | AdamW | | Batch size | 1 (per device) | | max_length | 2048 | | TTT (tree training tokens) | 7 | | Precision | bfloat16 | ### Training Method EAGLE3 trains a single-layer draft head that predicts the next token using hidden states captured from three auxiliary layers of the target model (layers 1, 30, 58 — early, middle, and late). The training objective is the Training-Time Test (TTT) loss, which simulates the speculative decoding accept/reject process during training to maximize the expected number of accepted tokens at inference time. ## Performance ### Training Accuracy (base checkpoint, before regenerated data fine-tuning) | Position | Accuracy | |----------|----------| | acc_0 | 0.820 | | acc_1 | 0.809 | | acc_2 | 0.781 | | acc_3 | 0.789 | | acc_4 | 0.777 | | acc_5 | 0.761 | | acc_6 | 0.730 | *The released model was fine-tuned for 6 additional epochs on 20K regenerated samples from the target model. The fine-tuned accuracy is expected to be equal or higher than these base values.* ### Inference Benchmarks (B=1, temp=0, TP=4) **With draft_tokens=8 (best B=1 config)**: | Dataset | Baseline (tok/s) | EAGLE3 (tok/s) | Speedup | |---------|-----------------|----------------|---------| | HumanEval | 109.3 | 230.6 | **2.11x** | | MT-Bench | 109.9 | 195.6 | **1.78x** | | SWEBench-Verified | 109.6 | 191.8 | **1.75x** | | Aider | 109.9 | 186.8 | **1.70x** | *Config: steps=3, topk=4, draft_tokens=8. 8x H200 (TP=4).* **With draft_tokens=6 (verified 2026-04-12)**: | Dataset | Baseline (tok/s) | EAGLE3 (tok/s) | Speedup | |---------|-----------------|----------------|---------| | HumanEval | 109.6 | 177.0 | **1.61x** | | Terminal-Bench | 108.9 | 160.8 | **1.48x** | | MT-Bench | 109.0 | 146.8 | **1.35x** | | SWEBench-Verified | 109.1 | 123.1 | **1.13x** | *Config: steps=3, topk=4, draft_tokens=8. 4x H200 (TP=4). Server-side Prometheus metrics.* ## Model Architecture | Parameter | Value | |-----------|-------| | Architecture | LlamaForCausalLMEagle3 | | Hidden size | 3072 | | Num hidden layers | 1 | | Num attention heads | 24 (8 KV heads) | | Intermediate size | 8192 | | Auxiliary layers | [1, 30, 58] | | Vocab size | 200064 (target) / 32000 (draft) | | Checkpoint size | ~464 MB | ## Limitations - **TP=4 only.** TP=8 fails due to FP8 block size constraint (`intermediate_size / 8 = 192`, not divisible by `block_n=128`). - **Temperature sensitivity.** Best performance at temp=0 (greedy). At temp=0.7, B=1 speedup drops to 1.27-1.80x and some B=32 datasets regress below baseline. - **Coding-focused benchmarks.** All benchmarks use coding-oriented datasets (HumanEval, SWEBench, Aider). Conversational workloads may show different patterns. - **SPEC_V2 incompatible.** The overlap scheduler (`SGLANG_ENABLE_SPEC_V2=true`) is not supported — standard (non-overlapped) speculation only. - **Requires SGLang fork.** Upstream SGLang does not yet include the FP8 dtype patches needed for Eagle3 on this model. ## License This draft head is released under Apache 2.0, matching the [MiniMax-M2.5 license](https://huggingface.co/MiniMaxAI/MiniMax-M2.5). ## Citation ```bibtex @inproceedings{li2025eagle3, title={{EAGLE-3}: Scaling up Inference Acceleration of Large Language Models via Training-Time Test}, author={Li, Yuhui and Wei, Fangyun and Zhang, Chao and Zhang, Hongyang}, booktitle={Advances in Neural Information Processing Systems (NeurIPS)}, year={2025} } ```