File size: 1,700 Bytes
9a9281d 4b2419b 9a9281d 286254c 9a9281d 014c9e9 e477c01 ebe946f 014c9e9 4b2419b e477c01 1577e3d 9a9281d 4f3f2bd 9a9281d 4f3f2bd 4b2419b 74c01d7 4b2419b 05579e6 4b2419b | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 | ---
license: mit
base_model:
- Qwen/Qwen3-8B
---
## Introduce
We adapted the official speculative sampling training method, Eagle3, for training on Qwen3-30B-A3B
After implementing Eagle3, the inference performance of Qwen3-30B-Moe using the SGLang framework on 8*H200 GPU improved from 183 tokens/s to 325 tokens/s.
The TPS (tokens per second) improvement reached nearly 70%.
On a single RTX 5090, the TPS (transactions per second) of Qwen3-8B-Eagle3 increased from 164 to 268.
| model | gpu | tps |
|---------|---------|---------|
| qwen3-30b_moe | h200 | 147 |
| qwen3-30b-moe_eagle3 | h200 | 231 |
| qwen3-30b_moe | 8*h200 | 183 |
| qwen3-30b_moe-eagle3 | 8*h200 | 325 |
| qwen3-30b_moe | 8*5090 | 164 |
| qwen3-30b_moe-eagle3 | 8*5090 | 268 |
## How to use
To use Eagle3 with SGLang, first replace the qwen3_moe.py file in SGLang’s directory (sglang/python/sglang/srt/models/) with the qwen3_moe.py file from this project.
The launch command for using Eagle3 with SGLang is:
```python3
python3 -m sglang.launch_server --model Qwen/Qwen3-30B-A3B --speculative-algorithm EAGLE3 --speculative-draft-model-path Tengyunw/qwen3_30b_moe_eagle3 --speculative-num-steps 6 --speculative-eagle-topk 10 --speculative-num-draft-tokens 32 --mem-fraction 0.9 --cuda-graph-max-bs 2 --dtype bfloat16
```
## How to train
Training Dataset:
ultrachat_200k.
Only the prompts from these datasets were utilized for data synthesis. This synthesized data is used to train the Eagle modules.
dataset nums: 600K samples,1B tokens
Evaluation Dataset:
ShareGPT,GSM8K,HUAMEVAL,MT-BENCH,APLCA
Our Sharegpt test data is located in the eagle_data.jsonl file under this directory. |