--- license: mit base_model: - Qwen/Qwen3-8B --- We adapted the official speculative sampling training method, Eagle3, for training on Qwen3-30B-A3B After implementing Eagle3, the inference performance of Qwen3-8B using the SGLang framework on 8*H200 GPU improved from 183 tokens/s to 325 tokens/s. The TPS (tokens per second) improvement reached nearly 70%. To use Eagle3 with SGLang, first replace the qwen3_moe.py file in SGLang’s directory (sglang/python/sglang/srt/models/) with the qwen3_moe.py file from this project. | model | gpu | tps | |---------|---------|---------| | qwen3-8b | h200 | 147 | | qwen3-8b-eagle3 | h200 | 231 | | qwen3-8b | 8*h200 | 183 | | qwen3-8b-eagle3 | 8*h200 | 325 | The launch command for using Eagle3 with SGLang is: ```python3 python3 -m sglang.launch_server --model Qwen/Qwen3-30B-A3B --speculative-algorithm EAGLE3 --speculative-draft-model-path Tengyunw/qwen3_30b_moe_eagle3 --speculative-num-steps 6 --speculative-eagle-topk 10 --speculative-num-draft-tokens 32 --mem-fraction 0.9 --cuda-graph-max-bs 2 --dtype bfloat16 ``` Our test data is located in the eagle_data.jsonl file under this directory.