Instructions to use z-lab/gemma-4-26B-A4B-it-DFlash with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use z-lab/gemma-4-26B-A4B-it-DFlash with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="z-lab/gemma-4-26B-A4B-it-DFlash")

# Load model directly
from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("z-lab/gemma-4-26B-A4B-it-DFlash")
model = AutoModel.from_pretrained("z-lab/gemma-4-26B-A4B-it-DFlash")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use z-lab/gemma-4-26B-A4B-it-DFlash with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "z-lab/gemma-4-26B-A4B-it-DFlash"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "z-lab/gemma-4-26B-A4B-it-DFlash",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/z-lab/gemma-4-26B-A4B-it-DFlash

SGLang

How to use z-lab/gemma-4-26B-A4B-it-DFlash with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "z-lab/gemma-4-26B-A4B-it-DFlash" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "z-lab/gemma-4-26B-A4B-it-DFlash",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "z-lab/gemma-4-26B-A4B-it-DFlash" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "z-lab/gemma-4-26B-A4B-it-DFlash",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use z-lab/gemma-4-26B-A4B-it-DFlash with Docker Model Runner:
```
docker model run hf.co/z-lab/gemma-4-26B-A4B-it-DFlash
```

jianchen0311 commited on May 5

Commit

e0f8b48

verified ·

1 Parent(s): 62145be

Create README.md

Browse files

Files changed (1) hide show

README.md +143 -0

README.md ADDED Viewed

	@@ -0,0 +1,143 @@

+---
+license: mit
+library_name: transformers
+pipeline_tag: text-generation
+tags:
+- dflash
+- speculative-decoding
+- block-diffusion
+- draft-model
+- efficiency
+- qwen
+- gemma
+- diffusion-language-model
+---
+# gemma-4-26B-A4B-it-DFlash
+[**Paper**](https://arxiv.org/abs/2602.06036) | [**GitHub**](https://github.com/z-lab/dflash) | [**Blog**](https://z-lab.ai/projects/dflash/)
+**DFlash** is a speculative decoding method that uses a lightweight **block diffusion** model to draft multiple tokens in parallel. This is the drafter model, which must be paired with [google/gemma-4-26B-A4B-it](https://huggingface.co/google/gemma-4-26B-A4B-it).
+<div align="center">
+  <img src="assets/dflash_system.png" alt="DFlash Architecture" width="85%">
+</div>
+## Quick Start
+### Installation
+vLLM (We temporarily modify the installation through this [PR](https://github.com/vllm-project/vllm/pull/41703) to support gemma4 DFlash inference):
+```bash
+uv pip install vllm
+uv pip install -U --torch-backend=auto "vllm @ git+https://github.com/vllm-project/vllm.git@refs/pull/41703/head"
+```
+SGLang:
+```bash
+uv pip install "git+https://github.com/sgl-project/sglang.git@refs/pull/23000/head#subdirectory=python"
+```
+### Launch Server
+vLLM:
+```bash
+vllm serve google/gemma-4-26B-A4B-it \
+  --speculative-config '{"method": "dflash", "model": "z-lab/gemma-4-26B-A4B-it-DFlash", "num_speculative_tokens": 15, "attention_backend": "flash_attn"}' \
+  --attention-backend triton_attn \
+  --max-num-batched-tokens 32768 \
+  --trust-remote-code
+```
+SGLang:
+```bash
+# Optional: enable schedule overlapping (experimental, may not be stable)
+# export SGLANG_ENABLE_SPEC_V2=1
+# export SGLANG_ENABLE_DFLASH_SPEC_V2=1
+# export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
+python -m sglang.launch_server \
+    --model-path google/gemma-4-26B-A4B-it \
+    --speculative-algorithm DFLASH \
+    --speculative-draft-model-path z-lab/gemma-4-26B-A4B-it-DFlash \
+    --speculative-num-draft-tokens 16 \
+    --tp-size 1 \
+    --attention-backend triton \
+    --speculative-draft-attention-backend fa4 \
+    --trust-remote-code
+```
+### Usage
+```python
+from openai import OpenAI
+client = OpenAI(base_url="http://localhost:30000/v1", api_key="EMPTY")
+response = client.chat.completions.create(
+    model="google/gemma-4-26B-A4B-it",
+    messages=[{"role": "user", "content": "Write a quicksort in Python."}],
+    max_tokens=4096,
+    temperature=0.0
+    extra_body={"chat_template_kwargs": {"enable_thinking": True}},
+)
+print(response.choices[0].message.content)
+```
+## Benchmark Results
+**Setup:** Single NVIDIA B300, vLLM, thinking enabled, max output length 4096.
+### Throughput and Speedup
+DFlash achieves up to **2.9x** speedup at concurrency 1.
+_Tokens/sec (speedup vs. autoregressive baseline)_
+**Block Size = 16**
+| Task | Concurrency | AR | **DFlash** |
+|---|---:|---:|---:|
+| Math500 | 1 | 259 | **925 (3.6x)** |
+|  | 8 | 1296 | **4837 (3.7x)** |
+|  | 32 | 3233 | **11435 (3.5x)** |
+| GSM8K | 1 | 256 | **825 (3.2x)** |
+|  | 8 | 1217 | **4241 (3.5x)** |
+|  | 32 | 3174 | **10306 (3.2x)** |
+| HumanEval | 1 | 246 | **818 (3.3x)** |
+|  | 8 | 1182 | **4240 (3.6x)** |
+|  | 32 | 2881 | **9150 (3.2x)** |
+| MBPP | 1 | 272 | **698 (2.6x)** |
+|  | 8 | 1288 | **3387 (2.6x)** |
+|  | 32 | 2950 | **7898 (2.7x)** |
+| MT-Bench | 1 | 272 | **492 (1.8x)** |
+|  | 8 | 1146 | **2259 (2.0x)** |
+|  | 32 | 2164 | **4829 (2.2x)** |
+### Acceptance Length
+| Task | c1 | c8 | c32 |
+|---|---:|---:|---:|
+| Math500 | 8.61 | 8.55 | 8.60 |
+| GSM8K | 7.71 | 7.76 | 7.72 |
+| HumanEval | 7.80 | 7.87 | 7.83 |
+| MBPP | 6.09 | 5.99 | 6.03 |
+| MT-Bench | 4.33 | 4.33 | 4.24 |
+## Acknowledgements
+Special thanks to [David Wang](https://davidwa.ng/) for his outstanding engineering support on this project. We are also grateful to [Modal](https://modal.com/), [InnoMatrix](https://innomatrix.ai), and [Yotta Labs](https://www.yottalabs.ai/) for providing the compute resources used to train this draft model.
+## Citation
+If you find DFlash useful, please cite our work. To share feedback on DFlash or request new model support, please fill out this form: [DFlash Feedback](https://forms.gle/4YNwfqb4nJdqn6hq9).
+```bibtex
+@article{chen2026dflash,
+  title   = {{DFlash: Block Diffusion for Flash Speculative Decoding}},
+  author  = {Chen, Jian and Liang, Yesheng and Liu, Zhijian},
+  journal = {arXiv preprint arXiv:2602.06036},
+  year    = {2026}
+}
+```