---
pipeline_tag: text-generation
library_name: transformers
base_model:
  - Qwen/Qwen3.5-35B-A3B
license: apache-2.0
inference: false
tags:
  - dflash
  - speculative-decoding
  - speculative-decoding-draft
  - block-diffusion
  - draft-model
  - diffusion-language-model
  - efficiency
  - qwen
  - qwen3
  - qwen3.5
  - sglang
---

# Qwen3.5-35B-A3B-DFlash

[Paper](https://arxiv.org/abs/2602.06036) | [Github](https://github.com/z-lab/dflash) | [Blog](https://z-lab.ai/projects/dflash)

This DFlash draft model is a joint retrain from [Z-Lab](https://z-lab.ai) and [Modal](https://modal.com), trained with 40k sequence length and sliding-window attention for improved long-context performance. It is mirrored across the following Hugging Face repositories:

- [`z-lab/Qwen3.5-35B-A3B-DFlash`](https://huggingface.co/z-lab/Qwen3.5-35B-A3B-DFlash)
- [`modal-labs/Qwen3.5-35B-A3B-DFlash`](https://huggingface.co/modal-labs/Qwen3.5-35B-A3B-DFlash)

This repository contains a DFlash draft model for `Qwen/Qwen3.5-35B-A3B`. It is not a standalone language model. It is intended to be paired with the target model in a speculative decoding server.

DFlash uses a lightweight block diffusion draft model to propose multiple tokens in parallel. The target model verifies those proposals, improving serving throughput while preserving the target model's output distribution.

<div align="center">
  <img src="assets/dflash_system.png" alt="DFlash Architecture" width="85%">
</div>

## Quick Start

### Installation

#### SGLang

Install a recent SGLang build with DFlash support:

```bash
uv pip install --upgrade "sglang[all]"
```

For best performance on Blackwell GPUs, use an SGLang build that includes DFlash, FA4/TRT-LLM attention, and FlashInfer support.

#### vLLM

For vLLM support, please refer to [vllm-project/vllm#40898](https://github.com/vllm-project/vllm/pull/40898). We will update the PR to make it merge-ready soon.

### Launch Server

This model should be used with an inference server that supports DFlash speculative decoding. An example SGLang deployment is:

```bash
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1

python -m sglang.launch_server \
  --model-path Qwen/Qwen3.5-35B-A3B \
  --trust-remote-code \
  --speculative-algorithm DFLASH \
  --speculative-draft-model-path z-lab/Qwen3.5-35B-A3B-DFlash \
  --speculative-dflash-block-size 8 \
  --speculative-draft-attention-backend fa4 \
  --attention-backend trtllm_mha \
  --linear-attn-prefill-backend flashinfer \
  --linear-attn-decode-backend flashinfer \
  --mamba-scheduler-strategy extra_buffer \
  --tp-size 1 \
  --max-running-requests 32 \
  --cuda-graph-max-bs-decode 32 \
  --cuda-graph-backend-prefill tc_piecewise \
  --enable-flashinfer-allreduce-fusion \
  --mem-fraction-static 0.8 \
  --host 0.0.0.0 \
  --port 30000
```

Block size `8` is the recommended default for higher-concurrency serving. Block size `16` gives longer accept lengths and strong concurrency-1 throughput in most workloads.

## Benchmark Results

We benchmarked DFlash against the autoregressive baseline and Qwen's built-in MTP draft path. DFlash reaches up to `3.71x` speedup at concurrency 1 and `2.89x` at concurrency 32. Across the benchmark suite, DFlash delivers higher throughput than MTP at every matched setting where both completed.

### Setup

- Runtime: SGLang on 1x NVIDIA B200 GPU, tensor parallel size 1, `bfloat16`
- Backends: `trtllm_mha` target attention, `fa4` DFlash draft attention, `flashinfer` linear-attention prefill and decode
- Workloads: GSM8K, MATH500, HumanEval, MBPP, and MT-Bench with the Qwen chat template
- Decoding: greedy, thinking enabled, max output length 4096 tokens
- Measurement: 5 independent runs per configuration at concurrency 1 and 32 with continuous batching
- Throughput: generated output tokens / wall-clock benchmark time, including prefill and scheduling
- Accept length: `completion_tokens / spec_verify_ct` per generation turn, averaged across generation turns

### Throughput and Speedup

Each cell is `output tok/s (speedup)`. Bold marks the fastest speculative configuration in each row.

#### Concurrency 1

| Workload | Baseline | MTP steps=3 | DFlash block=4 | MTP steps=7 | DFlash block=8 | MTP steps=15 | DFlash block=16 |
| --- | --- | --- | --- | --- | --- | --- | --- |
| gsm8k | 310.0 (1.00x) | 622.5 (2.01x) | 695.9 (2.24x) | 652.8 (2.11x) | 905.1 (2.92x) | 508.7 (1.64x) | **939.2 (3.03x)** |
| math500 | 308.0 (1.00x) | 645.7 (2.10x) | 723.2 (2.35x) | 710.1 (2.31x) | 995.6 (3.23x) | 569.8 (1.85x) | **1096.1 (3.56x)** |
| humaneval | 304.4 (1.00x) | 617.3 (2.03x) | 721.0 (2.37x) | 672.4 (2.21x) | 989.3 (3.25x) | 538.6 (1.77x) | **1128.1 (3.71x)** |
| mbpp | 309.0 (1.00x) | 605.4 (1.96x) | 717.3 (2.32x) | 619.8 (2.01x) | 949.4 (3.07x) | 468.6 (1.52x) | **1006.7 (3.26x)** |
| mt-bench | 307.9 (1.00x) | 571.5 (1.86x) | 630.2 (2.05x) | 555.8 (1.81x) | **736.0 (2.39x)** | 407.3 (1.32x) | 727.1 (2.36x) |

#### Concurrency 32

| Workload | Baseline | MTP steps=3 | DFlash block=4 | MTP steps=7 | DFlash block=8 | MTP steps=15 | DFlash block=16 |
| --- | --- | --- | --- | --- | --- | --- | --- |
| gsm8k | 3453.8 (1.00x) | 6298.1 (1.82x) | 7145.2 (2.07x) | 6953.7 (2.01x) | **8863.0 (2.57x)** | 5730.2 (1.66x) | 8275.6 (2.40x) |
| math500 | 3395.2 (1.00x) | 6679.7 (1.97x) | 7380.6 (2.17x) | 7771.4 (2.29x) | **9803.0 (2.89x)** | 6632.1 (1.95x) | 9776.9 (2.88x) |
| humaneval | 3287.7 (1.00x) | 5628.9 (1.71x) | 7077.2 (2.15x) | 6293.7 (1.91x) | **9096.8 (2.77x)** | 5152.8 (1.57x) | 9083.5 (2.76x) |
| mbpp | 3485.1 (1.00x) | 5549.7 (1.59x) | 7203.1 (2.07x) | 5925.6 (1.70x) | **9164.9 (2.63x)** | 4849.5 (1.39x) | 8758.6 (2.51x) |
| mt-bench | 3232.8 (1.00x) | 5651.5 (1.75x) | 6094.6 (1.89x) | 5920.9 (1.83x) | **6904.0 (2.14x)** | 4603.6 (1.42x) | 6109.7 (1.89x) |

### Accept Length

Mean accept length at concurrency 1. Bold marks the higher value in each matched MTP/DFlash pair.

| Workload | MTP steps=3 | DFlash block=4 | MTP steps=7 | DFlash block=8 | MTP steps=15 | DFlash block=16 |
| --- | --- | --- | --- | --- | --- | --- |
| gsm8k | **3.504** | 3.458 | 5.402 | **5.404** | 6.605 | **6.983** |
| math500 | **3.582** | 3.546 | 5.607 | **5.721** | 6.975 | **7.594** |
| humaneval | 3.547 | **3.602** | 5.561 | **5.900** | 6.888 | **8.218** |
| mbpp | 3.384 | **3.451** | 4.904 | **5.317** | 5.672 | **6.738** |
| mt-bench | **3.209** | 3.137 | **4.494** | 4.432 | 5.238 | **5.341** |

## Citation

If you find DFlash useful, please cite the original paper:

```bibtex
@article{chen2026dflash,
  title   = {{DFlash: Block Diffusion for Flash Speculative Decoding}},
  author  = {Chen, Jian and Liang, Yesheng and Liu, Zhijian},
  journal = {arXiv preprint arXiv:2602.06036},
  year    = {2026}
}
```