--- library_name: transformers license: apache-2.0 base_model: swiss-ai/Apertus-8B-Instruct-2509 tags: - eagle3 - speculative-decoding - draft-model - llama language: - en - de - fr - it pipeline_tag: text-generation --- # EAGLE3-Apertus-8B-Instruct-2509 An [Eagle3](https://arxiv.org/abs/2503.01840) draft model for speculative decoding with [swiss-ai/Apertus-8B-Instruct-2509](https://huggingface.co/swiss-ai/Apertus-8B-Instruct-2509). ## Model Description This is a lightweight draft model trained to accelerate inference of Apertus-8B-Instruct through speculative decoding. Eagle3 uses a single-layer architecture that predicts future tokens by leveraging the target model's hidden states. | Property | Value | |----------|-------| | Architecture | `LlamaForCausalLMEagle3` | | Hidden Size | 4096 | | Intermediate Size | 21504 | | Attention Heads | 32 | | KV Heads | 8 | | Layers | 1 | | Vocab Size | 131,072 | | Draft Vocab Size | 32,000 | | Precision | bfloat16 | | Parameters | ~513M | ## Training Details - **Framework**: [SpecForge](https://github.com/sgl-project/SpecForge) - **Target Model**: swiss-ai/Apertus-8B-Instruct-2509 - **Epochs**: 10 - **Batch Size**: 1 per GPU - **Learning Rate**: 1e-4 - **Max Sequence Length**: 4096 - **Hardware**: 64 GPUs (16 nodes × 4 GPUs) - **Precision**: bfloat16 ### Training Data The model was trained on ~375k samples of regenerated conversation data. The dataset consists of prompts from: - [UltraChat](https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k) - [ShareGPT](https://huggingface.co/datasets/Aeala/ShareGPT_Vicuna_unfiltered) - [OpenThoughts-114k-math](https://huggingface.co/datasets/open-r1/OpenThoughts-114k-math) The responses were regenerated using Apertus-8B-Instruct-2509 to ensure the draft model learns from the target model's own output distribution. See: [thomaskiefer/EAGLE3-Apertus-8B-Instruct-2509-Data](https://huggingface.co/datasets/thomaskiefer/EAGLE3-Apertus-8B-Instruct-2509-Data) ## Usage ### With vLLM ```bash VLLM_USE_V1=1 vllm serve swiss-ai/Apertus-8B-Instruct-2509 \ --speculative-config '{"model": "thomaskiefer/EAGLE3-Apertus-8B-Instruct-2509", "num_speculative_tokens": 3, "method": "eagle3"}' ``` Or in Python: ```python from vllm import LLM, SamplingParams llm = LLM( model="swiss-ai/Apertus-8B-Instruct-2509", speculative_config={ "model": "thomaskiefer/EAGLE3-Apertus-8B-Instruct-2509", "num_speculative_tokens": 3, "method": "eagle3", }, ) sampling_params = SamplingParams(temperature=0.7, max_tokens=256) outputs = llm.generate(["Hello, how are you?"], sampling_params) print(outputs[0].outputs[0].text) ``` ### With SGLang ```bash python -m sglang.launch_server \ --model swiss-ai/Apertus-8B-Instruct-2509 \ --speculative-algorithm EAGLE3 \ --speculative-draft-model-path thomaskiefer/EAGLE3-Apertus-8B-Instruct-2509 \ --speculative-num-steps 5 \ --speculative-eagle-topk 8 \ --speculative-num-draft-tokens 32 ``` ## Continue Training To resume training from this checkpoint: 1. Clone [SpecForge](https://github.com/sgl-project/SpecForge) 2. Download the training dataset from [thomaskiefer/EAGLE3-Apertus-8B-Instruct-2509-Data](https://huggingface.co/datasets/thomaskiefer/EAGLE3-Apertus-8B-Instruct-2509-Data) 3. Download this checkpoint and place it in a subdirectory of your output directory (e.g., `outputs/apertus-8b-eagle3/epoch_9_step_55000/`) 4. Run with `--resume` (it will automatically find the last checkpoint in `--output-dir`): ```bash NUM_GPUS=4 TP_SIZE=1 torchrun \ --standalone \ --nproc_per_node $NUM_GPUS \ scripts/train_eagle3.py \ --target-model-path swiss-ai/Apertus-8B-Instruct-2509 \ --draft-model-config /path/to/configs/apertus-8b-eagle3.json \ --train-data-path /path/to/merged_train_regen.jsonl \ --output-dir /path/to/outputs/apertus-8b-eagle3 \ --num-epochs 15 \ --batch-size 1 \ --tp-size $TP_SIZE \ --learning-rate 1e-4 \ --max-length 4096 \ --chat-template apertus \ --cache-dir /path/to/cache \ --target-model-backend sglang \ --resume ``` The `--resume` flag uses `get_last_checkpoint()` to automatically find the most recent checkpoint in the output directory. ## License Apache 2.0 ## Citation If you use this model, please cite Eagle3: ```bibtex @article{li2025eagle3, title={Eagle 3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test}, author={Li, Yuhui and Wei, Fangyun and Zhang, Chao and Zhang, Hongyang}, journal={arXiv preprint arXiv:2503.01840}, year={2025} } ``` ## Acknowledgments Trained on the [Alps supercomputer](https://www.cscs.ch/computers/alps) at CSCS (Swiss National Supercomputing Centre).