jianchen0311 commited on
Commit
e0f8b48
·
verified ·
1 Parent(s): 62145be

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +143 -0
README.md ADDED
@@ -0,0 +1,143 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ library_name: transformers
4
+ pipeline_tag: text-generation
5
+ tags:
6
+ - dflash
7
+ - speculative-decoding
8
+ - block-diffusion
9
+ - draft-model
10
+ - efficiency
11
+ - qwen
12
+ - gemma
13
+ - diffusion-language-model
14
+ ---
15
+
16
+ # gemma-4-26B-A4B-it-DFlash
17
+
18
+ [**Paper**](https://arxiv.org/abs/2602.06036) | [**GitHub**](https://github.com/z-lab/dflash) | [**Blog**](https://z-lab.ai/projects/dflash/)
19
+
20
+ **DFlash** is a speculative decoding method that uses a lightweight **block diffusion** model to draft multiple tokens in parallel. This is the drafter model, which must be paired with [google/gemma-4-26B-A4B-it](https://huggingface.co/google/gemma-4-26B-A4B-it).
21
+
22
+ <div align="center">
23
+ <img src="assets/dflash_system.png" alt="DFlash Architecture" width="85%">
24
+ </div>
25
+
26
+ ## Quick Start
27
+
28
+ ### Installation
29
+
30
+ vLLM (We temporarily modify the installation through this [PR](https://github.com/vllm-project/vllm/pull/41703) to support gemma4 DFlash inference):
31
+ ```bash
32
+ uv pip install vllm
33
+ uv pip install -U --torch-backend=auto "vllm @ git+https://github.com/vllm-project/vllm.git@refs/pull/41703/head"
34
+ ```
35
+
36
+ SGLang:
37
+ ```bash
38
+ uv pip install "git+https://github.com/sgl-project/sglang.git@refs/pull/23000/head#subdirectory=python"
39
+ ```
40
+
41
+ ### Launch Server
42
+
43
+ vLLM:
44
+ ```bash
45
+ vllm serve google/gemma-4-26B-A4B-it \
46
+ --speculative-config '{"method": "dflash", "model": "z-lab/gemma-4-26B-A4B-it-DFlash", "num_speculative_tokens": 15, "attention_backend": "flash_attn"}' \
47
+ --attention-backend triton_attn \
48
+ --max-num-batched-tokens 32768 \
49
+ --trust-remote-code
50
+ ```
51
+
52
+ SGLang:
53
+ ```bash
54
+ # Optional: enable schedule overlapping (experimental, may not be stable)
55
+ # export SGLANG_ENABLE_SPEC_V2=1
56
+ # export SGLANG_ENABLE_DFLASH_SPEC_V2=1
57
+ # export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
58
+
59
+ python -m sglang.launch_server \
60
+ --model-path google/gemma-4-26B-A4B-it \
61
+ --speculative-algorithm DFLASH \
62
+ --speculative-draft-model-path z-lab/gemma-4-26B-A4B-it-DFlash \
63
+ --speculative-num-draft-tokens 16 \
64
+ --tp-size 1 \
65
+ --attention-backend triton \
66
+ --speculative-draft-attention-backend fa4 \
67
+ --trust-remote-code
68
+ ```
69
+
70
+ ### Usage
71
+
72
+ ```python
73
+ from openai import OpenAI
74
+
75
+ client = OpenAI(base_url="http://localhost:30000/v1", api_key="EMPTY")
76
+
77
+ response = client.chat.completions.create(
78
+ model="google/gemma-4-26B-A4B-it",
79
+ messages=[{"role": "user", "content": "Write a quicksort in Python."}],
80
+ max_tokens=4096,
81
+ temperature=0.0
82
+ extra_body={"chat_template_kwargs": {"enable_thinking": True}},
83
+ )
84
+ print(response.choices[0].message.content)
85
+ ```
86
+
87
+ ## Benchmark Results
88
+
89
+ **Setup:** Single NVIDIA B300, vLLM, thinking enabled, max output length 4096.
90
+
91
+ ### Throughput and Speedup
92
+
93
+ DFlash achieves up to **2.9x** speedup at concurrency 1.
94
+
95
+ _Tokens/sec (speedup vs. autoregressive baseline)_
96
+
97
+ **Block Size = 16**
98
+ | Task | Concurrency | AR | **DFlash** |
99
+ |---|---:|---:|---:|
100
+ | Math500 | 1 | 259 | **925 (3.6x)** |
101
+ | | 8 | 1296 | **4837 (3.7x)** |
102
+ | | 32 | 3233 | **11435 (3.5x)** |
103
+ | GSM8K | 1 | 256 | **825 (3.2x)** |
104
+ | | 8 | 1217 | **4241 (3.5x)** |
105
+ | | 32 | 3174 | **10306 (3.2x)** |
106
+ | HumanEval | 1 | 246 | **818 (3.3x)** |
107
+ | | 8 | 1182 | **4240 (3.6x)** |
108
+ | | 32 | 2881 | **9150 (3.2x)** |
109
+ | MBPP | 1 | 272 | **698 (2.6x)** |
110
+ | | 8 | 1288 | **3387 (2.6x)** |
111
+ | | 32 | 2950 | **7898 (2.7x)** |
112
+ | MT-Bench | 1 | 272 | **492 (1.8x)** |
113
+ | | 8 | 1146 | **2259 (2.0x)** |
114
+ | | 32 | 2164 | **4829 (2.2x)** |
115
+
116
+
117
+ ### Acceptance Length
118
+
119
+ | Task | c1 | c8 | c32 |
120
+ |---|---:|---:|---:|
121
+ | Math500 | 8.61 | 8.55 | 8.60 |
122
+ | GSM8K | 7.71 | 7.76 | 7.72 |
123
+ | HumanEval | 7.80 | 7.87 | 7.83 |
124
+ | MBPP | 6.09 | 5.99 | 6.03 |
125
+ | MT-Bench | 4.33 | 4.33 | 4.24 |
126
+
127
+
128
+ ## Acknowledgements
129
+
130
+ Special thanks to [David Wang](https://davidwa.ng/) for his outstanding engineering support on this project. We are also grateful to [Modal](https://modal.com/), [InnoMatrix](https://innomatrix.ai), and [Yotta Labs](https://www.yottalabs.ai/) for providing the compute resources used to train this draft model.
131
+
132
+ ## Citation
133
+
134
+ If you find DFlash useful, please cite our work. To share feedback on DFlash or request new model support, please fill out this form: [DFlash Feedback](https://forms.gle/4YNwfqb4nJdqn6hq9).
135
+
136
+ ```bibtex
137
+ @article{chen2026dflash,
138
+ title = {{DFlash: Block Diffusion for Flash Speculative Decoding}},
139
+ author = {Chen, Jian and Liang, Yesheng and Liu, Zhijian},
140
+ journal = {arXiv preprint arXiv:2602.06036},
141
+ year = {2026}
142
+ }
143
+ ```