How to use from
Docker Model Runner
docker model run hf.co/lym00/Qwen3.6-35B-A3B-DFlash-GGUF-Test:
Quick Links

llama.cpp Pull Request: https://github.com/ggml-org/llama.cpp/pull/22105

DFlash Drafter: https://huggingface.co/z-lab/Qwen3.6-35B-A3B-DFlash

Steps (follow the PR)

  1. git clone -b dflash https://github.com/ruixiang63/llama.cpp

  2. download draft model from https://huggingface.co/z-lab/Qwen3.6-35B-A3B-DFlash/

  3. download tokenizer from https://huggingface.co/Qwen/Qwen3.6-35B-A3B

  4. convert draft model to gguf

python convert_hf_to_gguf.py ../Qwen3.6-35B-A3B-DFlash --outtype bf16 --target-model-dir ../Qwen3.6-35B-A3B --outfile ../Qwen3.6-35B-A3B-DFlash/Qwen3.6-35B-A3B-DFlash-bf16.gguf

  1. Build llama.cpp
  • CUDA
cmake -B build -DGGML_CUDA=ON

cmake --build build --config Release -j
  • VULKAN
cmake -B build -DGGML_VULKAN=ON

cmake --build build --config Release -j
  1. Run DFlash speculative decoding
# thinking off: set LLAMA_SPEC_NO_THINK=1
# Omit it to test thinking-mode behavior
export LLAMA_SPEC_NO_THINK=1

for prompt in \
    "Write a quicksort algorithm in Python. Write code only." \
    "Explain the Pythagorean theorem" \
    "Plan a 1 day trip to DC"; do
  echo "=== Prompt: $prompt ==="
  ./build/bin/llama-speculative-simple \
    -m  "${TARGET_MODEL_GGUF}" \
    -md "${DFLASH_MODEL_GGUF}" \
    --dflash -p "$prompt" -n 256 \
    --draft-max 16 \
    -cd 512 -c 1024 \
    --temp 0 --top-k 1 --seed 42 \
    -ngl 99 -ngld 99
done

Tests and Investigations

2026-04-19:

Rebase dflash feature onto latest master

git clone -b master https://github.com/ggml-org/llama.cpp
git remote add ruixiang63 https://github.com/ruixiang63/llama.cpp
git fetch ruixiang63
git checkout -b dflash-test origin/master
git merge ruixiang63/dflash --no-edit

Then solve conflicts manually

  • gguf-py/gguf/constants.py
  • src/CMakeLists.txt
  • src/llama-arch.cpp
  • src/llama-hparams.h
  • src/llama-model.cpp

Notes

from:

layer.bq = create_tensor(tn(LLM_TENSOR_ATTN_Q,   "bias", i), {n_embd_head_k * n_head}, TENSOR_NOT_REQUIRED);
layer.bk = create_tensor(tn(LLM_TENSOR_ATTN_K,   "bias", i), {n_embd_k_gqa},          TENSOR_NOT_REQUIRED);
layer.bv = create_tensor(tn(LLM_TENSOR_ATTN_V,   "bias", i), {n_embd_v_gqa},          TENSOR_NOT_REQUIRED);
layer.bo = create_tensor(tn(LLM_TENSOR_ATTN_OUT, "bias", i), {n_embd},                TENSOR_NOT_REQUIRED);

to:

layer.wq_b = create_tensor(tn(LLM_TENSOR_ATTN_Q,   "bias", i), {n_embd_head_k * n_head}, TENSOR_NOT_REQUIRED);
layer.wk_b = create_tensor(tn(LLM_TENSOR_ATTN_K,   "bias", i), {n_embd_k_gqa},          TENSOR_NOT_REQUIRED);
layer.wv_b = create_tensor(tn(LLM_TENSOR_ATTN_V,   "bias", i), {n_embd_v_gqa},          TENSOR_NOT_REQUIRED);
layer.wo_b = create_tensor(tn(LLM_TENSOR_ATTN_OUT, "bias", i), {n_embd},                TENSOR_NOT_REQUIRED);
  • src/models/dflash.cpp follows the same

layer.bq -> layer.wq_b

layer.bk -> layer.wk_b

layer.bv -> layer.wv_b

layer.bo -> layer.wo_b

        cur = build_attn(inp_attn,
                model.layers[il].wo, NULL, NULL, // 3rd tensor parameter (wo_s)
                Qcur, Kcur, Vcur, nullptr, nullptr, nullptr, kq_scale, il);

2026-04-20:

Support for Qwen3.5/3.6 MoE and notes

Z Lab's new benchmark

2026-04-22:

Re-uploaded gguf based on new drafter https://huggingface.co/z-lab/Qwen3.6-35B-A3B-DFlash/commit/31977fbe13a86e8b961774f773058175676d89b8

Issues and Solutions

/src/models/dflash.cpp:39: GGML_ASSERT(model.target_tok_embd != nullptr && "DFlash decoder requires target model's tok_embd") failed

check if --dflash param is added to the llama-speculative-simple test

Downloads last month
581
GGUF
Model size
0.5B params
Architecture
dflash
Hardware compatibility
Log In to add your hardware

8-bit

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support