| --- |
| license: mit |
| base_model: |
| - Qwen/Qwen3-8B |
| --- |
| |
| ## Introduce |
| We adapted the official speculative sampling training method, Eagle3, for training on Qwen3-30B-A3B |
|
|
| After implementing Eagle3, the inference performance of Qwen3-30B-Moe using the SGLang framework on 8*H200 GPU improved from 183 tokens/s to 325 tokens/s. |
| |
| The TPS (tokens per second) improvement reached nearly 70%. |
| |
| On a single RTX 5090, the TPS (transactions per second) of Qwen3-8B-Eagle3 increased from 164 to 268. |
| |
| |
| | model | gpu | tps | |
| |---------|---------|---------| |
| | qwen3-30b_moe | h200 | 147 | |
| | qwen3-30b-moe_eagle3 | h200 | 231 | |
| | qwen3-30b_moe | 8*h200 | 183 | |
| | qwen3-30b_moe-eagle3 | 8*h200 | 325 | |
| | qwen3-30b_moe | 8*5090 | 164 | |
| | qwen3-30b_moe-eagle3 | 8*5090 | 268 | |
|
|
| Join our AI computing power cloud platform now and enjoy the best AI cloud service experience. The link is as follows: https://tenyunn.com/ |
| ## How to use |
|
|
| To use Eagle3 with SGLang, first replace the qwen3_moe.py file in SGLang’s directory (sglang/python/sglang/srt/models/) with the qwen3_moe.py file from this project. |
|
|
|
|
| The launch command for using Eagle3 with SGLang is: |
|
|
| ```python3 |
| |
| python3 -m sglang.launch_server --model Qwen/Qwen3-30B-A3B --speculative-algorithm EAGLE3 --speculative-draft-model-path Tengyunw/qwen3_30b_moe_eagle3 --speculative-num-steps 6 --speculative-eagle-topk 10 --speculative-num-draft-tokens 32 --mem-fraction 0.9 --cuda-graph-max-bs 2 --dtype bfloat16 |
| |
| ``` |
|
|
| ## How to train |
|
|
| Training Dataset: |
| ultrachat_200k. |
| Only the prompts from these datasets were utilized for data synthesis. This synthesized data is used to train the Eagle modules. |
| |
| dataset nums: 600K samples,1B tokens |
| |
| Evaluation Dataset: |
| ShareGPT,GSM8K,HUAMEVAL,MT-BENCH,APLCA |
| |
| Our Sharegpt test data is located in the eagle_data.jsonl file under this directory. |