UnstableLlama commited on
Commit
88db466
·
verified ·
1 Parent(s): ac378ac

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +224 -174
README.md CHANGED
@@ -1,179 +1,229 @@
1
  ---
2
  license: mit
3
- library_name: transformers
4
- pipeline_tag: text-generation
 
 
5
  tags:
6
- - dflash
7
- - speculative-decoding
8
- - block-diffusion
9
- - draft-model
10
- - efficiency
11
- - qwen
12
- - diffusion-language-model
13
  ---
14
 
15
- # Qwen3.6-35B-A3B-DFlash
16
-
17
- [**Paper**](https://arxiv.org/abs/2602.06036) | [**GitHub**](https://github.com/z-lab/dflash) | [**Blog**](https://z-lab.ai/projects/dflash/)
18
-
19
- **DFlash** is a speculative decoding method that uses a lightweight **block diffusion** model to draft multiple tokens in parallel. This is the drafter model, which must be paired with [Qwen/Qwen3.6-35B-A3B](https://huggingface.co/Qwen/Qwen3.6-35B-A3B).
20
-
21
- <div align="center">
22
- <img src="assets/dflash_system.png" alt="DFlash Architecture" width="85%">
23
- </div>
24
-
25
- ## Quick Start
26
-
27
- ### Installation
28
-
29
- vLLM (We temporarily modify the installation through this PR to support interleaved SWA and ensure correct handling of target hidden states for optimal performance):
30
- ```bash
31
- uv pip install vllm
32
- uv pip install -U --torch-backend=auto "vllm @ git+https://github.com/vllm-project/vllm.git@refs/pull/40898/head"
33
- ```
34
-
35
- SGLang:
36
- ```bash
37
- uv pip install "git+https://github.com/sgl-project/sglang.git@refs/pull/20547/head#subdirectory=python"
38
- ```
39
-
40
- ### Launch Server
41
-
42
- vLLM:
43
- ```bash
44
- vllm serve Qwen/Qwen3.6-35B-A3B \
45
- --speculative-config '{"method": "dflash", "model": "z-lab/Qwen3.6-35B-A3B-DFlash", "num_speculative_tokens": 15}' \
46
- --attention-backend flash_attn \
47
- --max-num-batched-tokens 32768
48
- ```
49
-
50
- SGLang:
51
- ```bash
52
- # Optional: enable schedule overlapping (experimental, may not be stable)
53
- # export SGLANG_ENABLE_SPEC_V2=1
54
- # export SGLANG_ENABLE_DFLASH_SPEC_V2=1
55
- # export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
56
-
57
- python -m sglang.launch_server \
58
- --model-path Qwen/Qwen3.6-35B-A3B \
59
- --speculative-algorithm DFLASH \
60
- --speculative-draft-model-path z-lab/Qwen3.6-35B-A3B-DFlash \
61
- --speculative-num-draft-tokens 16 \
62
- --tp-size 1 \
63
- --attention-backend fa3 \
64
- --mem-fraction-static 0.75 \
65
- --mamba-scheduler-strategy extra_buffer \
66
- --trust-remote-code
67
- ```
68
- > **Tip:** For long-context or agentic workloads, add `--speculative-dflash-draft-window-size WINDOW_SIZE` to enable sliding-window attention for the drafter.
69
-
70
- ### Usage
71
-
72
- ```python
73
- from openai import OpenAI
74
-
75
- client = OpenAI(base_url="http://localhost:30000/v1", api_key="EMPTY")
76
-
77
- response = client.chat.completions.create(
78
- model="Qwen/Qwen3.6-35B-A3B",
79
- messages=[{"role": "user", "content": "Write a quicksort in Python."}],
80
- max_tokens=4096,
81
- temperature=0.0
82
- )
83
- print(response.choices[0].message.content)
84
- ```
85
-
86
- ## Benchmark Results
87
-
88
- **Setup:** Single NVIDIA B200, SGLang, thinking enabled, max output length 4096. We report end-to-end throughput, including prefill time. See our [GitHub repository](https://github.com/z-lab/dflash) for reproduction scripts.
89
-
90
- ### Throughput and Speedup
91
-
92
- DFlash achieves up to **2.9x** speedup at concurrency 1.
93
-
94
- _Tokens/sec (speedup vs. autoregressive baseline)_
95
-
96
- **Block Size = 16**
97
- | Task | Concurrency | AR | **DFlash** |
98
- |---|---:|---:|---:|
99
- | Math500 | 1 | 234 | **682 (2.9x)** |
100
- | | 8 | 1266 | **3138 (2.5x)** |
101
- | | 16 | 1954 | **4813 (2.5x)** |
102
- | | 32 | 2755 | **6520 (2.4x)** |
103
- | GSM8K | 1 | 235 | **556 (2.4x)** |
104
- | | 8 | 1236 | **2564 (2.1x)** |
105
- | | 16 | 1886 | **3821 (2.0x)** |
106
- | | 32 | 2699 | **5239 (1.9x)** |
107
- | HumanEval | 1 | 238 | **603 (2.5x)** |
108
- | | 8 | 1255 | **2800 (2.2x)** |
109
- | | 16 | 1944 | **4208 (2.2x)** |
110
- | | 32 | 2767 | **5782 (2.1x)** |
111
- | MBPP | 1 | 235 | **559 (2.4x)** |
112
- | | 8 | 1224 | **2538 (2.1x)** |
113
- | | 16 | 1948 | **3816 (2.0x)** |
114
- | | 32 | 2780 | **5378 (1.9x)** |
115
- | MT-Bench | 1 | 233 | **442 (1.9x)** |
116
- | | 8 | 1238 | **2028 (1.6x)** |
117
- | | 16 | 1885 | **2997 (1.6x)** |
118
- | | 32 | 2633 | **4034 (1.5x)** |
119
- | Alpaca | 1 | 235 | **393 (1.7x)** |
120
- | | 8 | 1221 | **1782 (1.5x)** |
121
- | | 16 | 1844 | **2567 (1.4x)** |
122
- | | 32 | 2579 | **3689 (1.4x)** |
123
-
124
- **Block Size = 8**
125
- | Task | Concurrency | AR | **DFlash** |
126
- |---|---:|---:|---:|
127
- | Math500 | 1 | 234 | **617 (2.6x)** |
128
- | | 8 | 1266 | **2839 (2.2x)** |
129
- | | 16 | 1954 | **4465 (2.3x)** |
130
- | | 32 | 2755 | **6614 (2.4x)** |
131
- | GSM8K | 1 | 235 | **540 (2.3x)** |
132
- | | 8 | 1236 | **2466 (2.0x)** |
133
- | | 16 | 1886 | **3899 (2.1x)** |
134
- | | 32 | 2699 | **5713 (2.1x)** |
135
- | HumanEval | 1 | 238 | **561 (2.4x)** |
136
- | | 8 | 1255 | **2655 (2.1x)** |
137
- | | 16 | 1944 | **4135 (2.1x)** |
138
- | | 32 | 2767 | **6059 (2.2x)** |
139
- | MBPP | 1 | 235 | **497 (2.1x)** |
140
- | | 8 | 1224 | **2324 (1.9x)** |
141
- | | 16 | 1948 | **3636 (1.9x)** |
142
- | | 32 | 2780 | **4884 (1.8x)** |
143
- | MT-Bench | 1 | 233 | **438 (1.9x)** |
144
- | | 8 | 1238 | **2060 (1.7x)** |
145
- | | 16 | 1885 | **3182 (1.7x)** |
146
- | | 32 | 2633 | **4720 (1.8x)** |
147
- | Alpaca | 1 | 235 | **407 (1.7x)** |
148
- | | 8 | 1221 | **1880 (1.5x)** |
149
- | | 16 | 1844 | **2903 (1.6x)** |
150
- | | 32 | 2579 | **4115 (1.6x)** |
151
-
152
- ### Acceptance Length
153
-
154
- | Task | B8 | B16 |
155
- |---|---:|---:|
156
- | Math500 | 5.56 | 7.35 |
157
- | GSM8K | 5.21 | 6.73 |
158
- | HumanEval | 5.09 | 6.44 |
159
- | MBPP | 4.78 | 5.83 |
160
- | MT-Bench | 4.20 | 5.14 |
161
- | Alpaca | 3.94 | 4.62 |
162
-
163
-
164
- ## Acknowledgements
165
-
166
- Special thanks to [David Wang](https://davidwa.ng/) for his outstanding engineering support on this project. We are also grateful to [Modal](https://modal.com/), [InnoMatrix](https://innomatrix.ai), and [Yotta Labs](https://www.yottalabs.ai/) for providing the compute resources used to train this draft model.
167
-
168
- ## Citation
169
-
170
- If you find DFlash useful, please cite our work. To share feedback on DFlash or request new model support, please fill out this form: [DFlash Feedback](https://forms.gle/4YNwfqb4nJdqn6hq9).
171
-
172
- ```bibtex
173
- @article{chen2026dflash,
174
- title = {{DFlash: Block Diffusion for Flash Speculative Decoding}},
175
- author = {Chen, Jian and Liang, Yesheng and Liu, Zhijian},
176
- journal = {arXiv preprint arXiv:2602.06036},
177
- year = {2026}
178
- }
179
- ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: mit
3
+ base_model:
4
+ - z-lab/Qwen3.6-35B-A3B-DFlash
5
+ base_model_relation: quantized
6
+ quantized_by: UnstableLlama
7
  tags:
8
+ - exl3
 
 
 
 
 
 
9
  ---
10
 
11
+ <style>
12
+ @import url('https://fonts.googleapis.com/css2?family=JetBrains+Mono:wght@400;700&family=Inter:wght@400;700&display=swap');
13
+
14
+ .dashboard-container {
15
+ font-family: 'Inter', sans-serif;
16
+ width: min(1500px, calc(100vw - 32px));
17
+ max-width: 100%;
18
+ margin: 0 auto;
19
+ box-sizing: border-box;
20
+ background-color: #1a1b1e;
21
+ background-image: radial-gradient(#2d2f34 1px, transparent 1px);
22
+ background-size: 20px 20px;
23
+ color: #e0e0e0;
24
+ padding: 40px 24px;
25
+ border: 1px solid #4a4d53;
26
+ border-radius: 12px;
27
+ box-shadow: 0 10px 30px rgba(0,0,0,0.5);
28
+ }
29
+
30
+ .dashboard-header {
31
+ margin-bottom: 35px;
32
+ }
33
+
34
+ .dashboard-header h1 {
35
+ font-family: 'JetBrains Mono', monospace;
36
+ color: #ffffff;
37
+ font-size: 1.6em;
38
+ margin: 0;
39
+ padding-left: 15px;
40
+ border-left: 4px solid #4dabf7;
41
+ }
42
+
43
+ .meta-tag {
44
+ font-family: 'JetBrains Mono', monospace;
45
+ font-size: 0.8em;
46
+ color: #4dabf7;
47
+ background: rgba(77, 171, 247, 0.1);
48
+ padding: 4px 12px;
49
+ border: 1px solid rgba(77, 171, 247, 0.3);
50
+ border-radius: 4px;
51
+ margin-top: 12px;
52
+ display: inline-block;
53
+ }
54
+
55
+ .content-panel {
56
+ background-color: #25262b;
57
+ border: 1px solid #4a4d53;
58
+ border-radius: 8px;
59
+ margin-bottom: 25px;
60
+ overflow: hidden;
61
+ }
62
+
63
+ .panel-title {
64
+ background-color: #4dabf7;
65
+ color: #000000;
66
+ padding: 8px 20px;
67
+ font-family: 'JetBrains Mono', monospace;
68
+ font-weight: 800;
69
+ font-size: 0.85em;
70
+ text-transform: uppercase;
71
+ letter-spacing: 1px;
72
+ }
73
+
74
+ .panel-body {
75
+ padding: 20px;
76
+ }
77
+
78
+ .repo-data-panel {
79
+ padding: 14px 10px;
80
+ }
81
+
82
+ .repo-data-body {
83
+ display: flex;
84
+ flex-direction: column;
85
+ align-items: center;
86
+ gap: 20px;
87
+ width: 100%;
88
+ --edge-gap: 8px;
89
+ }
90
+
91
+ .repo-graph {
92
+ display: block;
93
+ width: min(1440px, calc(100% - (var(--edge-gap) * 2)));
94
+ height: auto;
95
+ margin: 0 auto;
96
+ }
97
+
98
+ .table-wrapper {
99
+ display: inline-block;
100
+ margin: 0 auto;
101
+ border: 1px solid #666a73;
102
+ border-radius: 4px;
103
+ overflow: hidden;
104
+ max-width: calc(100% - (var(--edge-gap) * 2));
105
+ }
106
+
107
+ .data-table {
108
+ border-collapse: collapse;
109
+ font-family: 'JetBrains Mono', monospace;
110
+ font-size: 0.85em;
111
+ width: auto;
112
+ margin: 0;
113
+ }
114
+
115
+ .data-table th {
116
+ text-align: left;
117
+ color: #ffffff;
118
+ background-color: #2d2f34;
119
+ padding: 9px 12px;
120
+ border-bottom: 2px solid #666a73;
121
+ border-right: 1px solid #4a4d53;
122
+ }
123
+
124
+ .data-table td {
125
+ padding: 7px 12px;
126
+ border-bottom: 1px solid #4a4d53;
127
+ border-right: 1px solid #4a4d53;
128
+ }
129
+
130
+ .data-table tr td:last-child,
131
+ .data-table tr th:last-child {
132
+ border-right: none;
133
+ }
134
+
135
+ .data-table tr:last-child td {
136
+ border-bottom: none;
137
+ }
138
+
139
+ .data-table tr:hover td {
140
+ background-color: rgba(77, 171, 247, 0.05);
141
+ }
142
+
143
+ .link-style {
144
+ color: #4dabf7;
145
+ text-decoration: none;
146
+ }
147
+
148
+ .link-style:hover {
149
+ text-decoration: underline;
150
+ color: #ffffff;
151
+ }
152
+
153
+ .terminal-box {
154
+ background-color: #0c0d0e;
155
+ border: 1px solid #4a4d53;
156
+ border-radius: 6px;
157
+ padding: 18px;
158
+ font-family: 'JetBrains Mono', monospace;
159
+ font-size: 0.85em;
160
+ color: #cbd5e0;
161
+ }
162
+ </style>
163
+
164
+ <div class="dashboard-container">
165
+
166
+ <div class="dashboard-header">
167
+ <h1>Qwen / Qwen3.6-27B</h1>
168
+ <div class="meta-tag">QUANTIZED BY: UnstableLlama</div>
169
+ </div>
170
+
171
+ <div class="content-panel">
172
+ <div class="panel-title">Information</div>
173
+ <div class="panel-body">
174
+ 3.00bpw exl3 quantization <b><a class="link-style" href="https://huggingface.co/z-lab/Qwen3.6-35B-A3B-DFlash">Qwen3.6-35B-A3B-DFlash</a></b> via
175
+ <b><a class="link-style" href="https://github.com/turboderp-org/exllamav3">exllamav3</a></b>.
176
+ <br/>
177
+ repo generated automatically with
178
+ <a class="link-style" href="https://github.com/UnstableLlama/ezexl3">ezexl3</a>.
179
+ </div>
180
+ </div>
181
+
182
+ <div class="content-panel">
183
+ <div class="panel-title">Repo Data</div>
184
+ <div class="panel-body repo-data-body repo-data-panel">
185
+ <div class="table-wrapper">
186
+ <table class="data-table">
187
+ <thead>
188
+ <tr>
189
+ <th>REVISION</th>
190
+ <th>MB</th>
191
+ </tr>
192
+ </thead>
193
+ <tbody>
194
+ <tr>
195
+ <td><a class="link-style" href="https://huggingface.co/UnstableLlama/Qwen3.6-35B-A3B-DFlash-exl3-2.50bpw/">2.50bpw</a></td>
196
+ <td>145</td>
197
+ </tr>
198
+ <tr>
199
+ <td><a class="link-style" href="https://huggingface.co/UnstableLlama/Qwen3.6-35B-A3B-DFlash-exl3-3.00bpw/">3.00bpw</a></td>
200
+ <td>174</td>
201
+ </tr>
202
+ <tr>
203
+ <td><a class="link-style" href="https://huggingface.co/UnstableLlama/Qwen3.6-35B-A3B-DFlash-exl3-3.50bpw/">3.50bpw</a></td>
204
+ <td>203</td>
205
+ </tr>
206
+ <tr>
207
+ <td><a class="link-style" href="https://huggingface.co/UnstableLlama/Qwen3.6-35B-A3B-DFlash-exl3-4.00bpw/">4.00bpw</a></td>
208
+ <td>226</td>
209
+ </tr>
210
+ <tr>
211
+ <td><a class="link-style" href="https://huggingface.co/UnstableLlama/Qwen3.6-35B-A3B-DFlash-exl3-4.00bpw/">6.00bpw</a></td>
212
+ <td>348</td>
213
+ </tr>
214
+ </tbody>
215
+ </table>
216
+ </div>
217
+ </div>
218
+ </div>
219
+
220
+ <div class="content-panel">
221
+ <div class="panel-title">CLI Download</div>
222
+ <div class="panel-body">
223
+ <div class="terminal-box">
224
+ hf download UnstableLlama/Qwen3.6-35B-A3B-DFlash-exl3-3.00bpw --local-dir ./Qwen3.6-35B-A3B-DFlash-exl3-3.00bpw
225
+ </div>
226
+ </div>
227
+ </div>
228
+
229
+ </div>