File size: 1,700 Bytes

9a9281d
 
 
 
 
4b2419b
 
9a9281d
 
286254c
9a9281d
 
 
014c9e9
 
e477c01
 
 
ebe946f
 
 
 
014c9e9
 
4b2419b
e477c01
1577e3d
 
 
9a9281d
 
4f3f2bd
 
9a9281d
 
4f3f2bd
4b2419b
74c01d7
4b2419b
 
 
05579e6
4b2419b

---
license: mit
base_model:
- Qwen/Qwen3-8B
---

## Introduce
We adapted the official speculative sampling training method, Eagle3, for training on Qwen3-30B-A3B

After implementing Eagle3, the inference performance of Qwen3-30B-Moe using the SGLang framework on 8*H200 GPU improved from 183 tokens/s to 325 tokens/s.

The TPS (tokens per second) improvement reached nearly 70%.

On a single RTX 5090, the TPS (transactions per second) of Qwen3-8B-Eagle3 increased from 164 to 268.


| model | gpu | tps |
|---------|---------|---------|
| qwen3-30b_moe   | h200   | 147  |
| qwen3-30b-moe_eagle3   | h200   | 231   |
| qwen3-30b_moe   | 8*h200   | 183   |
| qwen3-30b_moe-eagle3  | 8*h200  | 325  |
| qwen3-30b_moe   | 8*5090   | 164   |
| qwen3-30b_moe-eagle3  | 8*5090  | 268  |
## How to use

To use Eagle3 with SGLang, first replace the qwen3_moe.py file in SGLang’s directory (sglang/python/sglang/srt/models/) with the qwen3_moe.py file from this project.


The launch command for using Eagle3 with SGLang is:

```python3

python3 -m sglang.launch_server --model Qwen/Qwen3-30B-A3B --speculative-algorithm EAGLE3 --speculative-draft-model-path Tengyunw/qwen3_30b_moe_eagle3 --speculative-num-steps 6 --speculative-eagle-topk 10 --speculative-num-draft-tokens 32 --mem-fraction 0.9 --cuda-graph-max-bs 2 --dtype bfloat16

```

## How to train

Training Dataset:
ultrachat_200k.
Only the prompts from these datasets were utilized for data synthesis. This synthesized data is used to train the Eagle modules. 

dataset nums: 600K samples,1B tokens

Evaluation Dataset:
ShareGPT,GSM8K,HUAMEVAL,MT-BENCH,APLCA

Our Sharegpt test data is located in the eagle_data.jsonl file under this directory.