---
pipeline_tag: text-generation
library_name: transformers
base_model:
- Qwen/Qwen3.5-4B
license: apache-2.0
inference: false
tags:
- dflash
- speculative-decoding
- speculative-decoding-draft
- block-diffusion
- draft-model
- diffusion-language-model
- efficiency
- qwen
- qwen3
- qwen3.5
- sglang
---
# Qwen3.5-4B-DFlash
[Paper](https://arxiv.org/abs/2602.06036) | [Github](https://github.com/z-lab/dflash) | [Blog](https://z-lab.ai/projects/dflash)
This DFlash draft model is a joint retrain from [Z-Lab](https://z-lab.ai) and [Modal](https://modal.com), trained with 40k sequence length and sliding-window attention for improved long-context performance. It is mirrored across the following Hugging Face repositories:
- [`z-lab/Qwen3.5-4B-DFlash`](https://huggingface.co/z-lab/Qwen3.5-4B-DFlash)
- [`modal-labs/Qwen3.5-4B-DFlash`](https://huggingface.co/modal-labs/Qwen3.5-4B-DFlash)
This repository contains a DFlash draft model for `Qwen/Qwen3.5-4B`. It is not a standalone language model. It is intended to be paired with the target model in a speculative decoding server.
DFlash uses a lightweight block diffusion draft model to propose multiple tokens in parallel. The target model verifies those proposals, improving serving throughput while preserving the target model's output distribution.
## Quick Start
### Installation
#### SGLang
Install a recent SGLang build with DFlash support:
```bash
uv pip install --upgrade "sglang[all]"
```
For best performance on Blackwell GPUs, use an SGLang build that includes DFlash, FA4/TRT-LLM attention, and FlashInfer support.
#### vLLM
For vLLM support, please refer to [vllm-project/vllm#40898](https://github.com/vllm-project/vllm/pull/40898). We will update the PR to make it merge-ready soon.
### Launch Server
This model should be used with an inference server that supports DFlash speculative decoding. An example SGLang deployment is:
```bash
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
python -m sglang.launch_server \
--model-path Qwen/Qwen3.5-4B \
--trust-remote-code \
--speculative-algorithm DFLASH \
--speculative-draft-model-path z-lab/Qwen3.5-4B-DFlash \
--speculative-dflash-block-size 8 \
--speculative-draft-attention-backend fa4 \
--attention-backend trtllm_mha \
--linear-attn-prefill-backend flashinfer \
--linear-attn-decode-backend flashinfer \
--mamba-scheduler-strategy extra_buffer \
--tp-size 1 \
--max-running-requests 32 \
--cuda-graph-max-bs-decode 32 \
--cuda-graph-backend-prefill tc_piecewise \
--enable-flashinfer-allreduce-fusion \
--mem-fraction-static 0.8 \
--host 0.0.0.0 \
--port 30000
```
Block size `8` is the recommended default for higher-concurrency serving. Block size `16` gives longer accept lengths and strong concurrency-1 throughput in most workloads.
## Benchmark Results
We benchmarked DFlash against the autoregressive baseline and Qwen's built-in MTP draft path. DFlash reaches up to `4.60x` speedup at concurrency 1 and `2.61x` at concurrency 32. Across the benchmark suite, DFlash delivers higher throughput than MTP at every matched setting where both completed.
### Setup
- Runtime: SGLang on 1x NVIDIA B200 GPU, tensor parallel size 1, `bfloat16`
- Backends: `trtllm_mha` target attention, `fa4` DFlash draft attention, `flashinfer` linear-attention prefill and decode
- Workloads: GSM8K, MATH500, HumanEval, MBPP, and MT-Bench with the Qwen chat template
- Decoding: greedy, thinking enabled, max output length 4096 tokens
- Measurement: 5 independent runs per configuration at concurrency 1 and 32 with continuous batching
- Throughput: generated output tokens / wall-clock benchmark time, including prefill and scheduling
- Accept length: `completion_tokens / spec_verify_ct` per generation turn, averaged across generation turns
### Throughput and Speedup
Each cell is `output tok/s (speedup)`. Bold marks the fastest speculative configuration in each row.
#### Concurrency 1
| Workload | Baseline | MTP steps=3 | DFlash block=4 | MTP steps=7 | DFlash block=8 | MTP steps=15 | DFlash block=16 |
| --- | --- | --- | --- | --- | --- | --- | --- |
| gsm8k | 356.0 (1.00x) | 739.3 (2.08x) | 859.6 (2.41x) | 772.0 (2.17x) | 1226.8 (3.45x) | 585.3 (1.64x) | **1387.4 (3.90x)** |
| math500 | 360.2 (1.00x) | 763.7 (2.12x) | 899.4 (2.50x) | 832.5 (2.31x) | 1355.7 (3.76x) | 645.1 (1.79x) | **1636.5 (4.54x)** |
| humaneval | 355.8 (1.00x) | 739.7 (2.08x) | 892.2 (2.51x) | 803.0 (2.26x) | 1325.2 (3.72x) | 594.7 (1.67x) | **1634.9 (4.60x)** |
| mbpp | 360.2 (1.00x) | 723.9 (2.01x) | 895.6 (2.49x) | 737.2 (2.05x) | 1314.9 (3.65x) | 557.4 (1.55x) | **1494.9 (4.15x)** |
| mt-bench | 356.5 (1.00x) | 708.8 (1.99x) | 806.7 (2.26x) | 699.0 (1.96x) | 1085.3 (3.04x) | 528.7 (1.48x) | **1211.0 (3.40x)** |
#### Concurrency 32
| Workload | Baseline | MTP steps=3 | DFlash block=4 | MTP steps=7 | DFlash block=8 | MTP steps=15 | DFlash block=16 |
| --- | --- | --- | --- | --- | --- | --- | --- |
| gsm8k | 7501.6 (1.00x) | 12716.8 (1.70x) | 15015.4 (2.00x) | 12419.5 (1.66x) | **17613.7 (2.35x)** | 8696.7 (1.16x) | 14203.5 (1.89x) |
| math500 | 7573.5 (1.00x) | 13482.4 (1.78x) | 15876.2 (2.10x) | 13636.5 (1.80x) | **19759.4 (2.61x)** | 9663.1 (1.28x) | 17060.5 (2.25x) |
| humaneval | 7286.1 (1.00x) | 12313.2 (1.69x) | 15284.3 (2.10x) | 12326.0 (1.69x) | **18792.5 (2.58x)** | 9115.4 (1.25x) | 16492.0 (2.26x) |
| mbpp | 7065.9 (1.00x) | 11032.0 (1.56x) | 14641.6 (2.07x) | 10842.3 (1.53x) | **17908.0 (2.53x)** | 7744.1 (1.10x) | 15427.4 (2.18x) |
| mt-bench | 6797.1 (1.00x) | 11514.5 (1.69x) | 12715.5 (1.87x) | 11155.8 (1.64x) | **14623.7 (2.15x)** | 8045.7 (1.18x) | 12007.6 (1.77x) |
### Accept Length
Mean accept length at concurrency 1. Bold marks the higher value in each matched MTP/DFlash pair.
| Workload | MTP steps=3 | DFlash block=4 | MTP steps=7 | DFlash block=8 | MTP steps=15 | DFlash block=16 |
| --- | --- | --- | --- | --- | --- | --- |
| gsm8k | 3.422 | **3.427** | 5.133 | **5.299** | 6.175 | **6.748** |
| math500 | 3.502 | **3.528** | 5.345 | **5.650** | 6.468 | **7.478** |
| humaneval | 3.448 | **3.551** | 5.193 | **5.684** | 6.147 | **7.719** |
| mbpp | 3.272 | **3.418** | 4.611 | **5.236** | 5.326 | **6.527** |
| mt-bench | **3.266** | 3.234 | 4.626 | **4.704** | 5.610 | **5.933** |
## Citation
If you find DFlash useful, please cite the original paper:
```bibtex
@article{chen2026dflash,
title = {{DFlash: Block Diffusion for Flash Speculative Decoding}},
author = {Chen, Jian and Liang, Yesheng and Liu, Zhijian},
journal = {arXiv preprint arXiv:2602.06036},
year = {2026}
}
```