--- license: mit --- llama.cpp Pull Request: https://github.com/ggml-org/llama.cpp/pull/22105 DFlash Drafter: https://huggingface.co/z-lab/Qwen3.6-35B-A3B-DFlash Steps (follow the PR) 1) `git clone -b dflash https://github.com/ruixiang63/llama.cpp` 2) download draft model from https://huggingface.co/z-lab/Qwen3.6-35B-A3B-DFlash/ 3) download tokenizer from https://huggingface.co/Qwen/Qwen3.6-35B-A3B 4) convert draft model to gguf ```python convert_hf_to_gguf.py ../Qwen3.6-35B-A3B-DFlash --outtype bf16 --target-model-dir ../Qwen3.6-35B-A3B --outfile ../Qwen3.6-35B-A3B-DFlash/Qwen3.6-35B-A3B-DFlash-bf16.gguf``` 5) Build llama.cpp - CUDA ```bash cmake -B build -DGGML_CUDA=ON cmake --build build --config Release -j ``` - VULKAN ```bash cmake -B build -DGGML_VULKAN=ON cmake --build build --config Release -j ``` 6) Run DFlash speculative decoding ```bash # thinking off: set LLAMA_SPEC_NO_THINK=1 # Omit it to test thinking-mode behavior export LLAMA_SPEC_NO_THINK=1 for prompt in \ "Write a quicksort algorithm in Python. Write code only." \ "Explain the Pythagorean theorem" \ "Plan a 1 day trip to DC"; do echo "=== Prompt: $prompt ===" ./build/bin/llama-speculative-simple \ -m "${TARGET_MODEL_GGUF}" \ -md "${DFLASH_MODEL_GGUF}" \ --dflash -p "$prompt" -n 256 \ --draft-max 16 \ -cd 512 -c 1024 \ --temp 0 --top-k 1 --seed 42 \ -ngl 99 -ngld 99 done ``` --- ## Tests and Investigations # 2026-04-19: Rebase dflash feature onto latest master ```bash git clone -b master https://github.com/ggml-org/llama.cpp git remote add ruixiang63 https://github.com/ruixiang63/llama.cpp git fetch ruixiang63 git checkout -b dflash-test origin/master git merge ruixiang63/dflash --no-edit ``` Then solve conflicts manually - gguf-py/gguf/constants.py - src/CMakeLists.txt - src/llama-arch.cpp - src/llama-hparams.h - src/llama-model.cpp Notes - src/CMakeLists.txt Use glob to collect src/models sources: https://github.com/ggml-org/llama.cpp/pull/22005/changes - src/llama-arch.cpp remove per-arch tensor name lists: https://github.com/ggml-org/llama.cpp/pull/21531/changes - src/llama-model.cpp Refactor bias tensor variable names: https://github.com/ggml-org/llama.cpp/pull/22079/changes#diff-36e262e316ec1404e29880eb8b8ce4660ac584f0d0434710efc48a66497bdb59 from: ```cpp layer.bq = create_tensor(tn(LLM_TENSOR_ATTN_Q, "bias", i), {n_embd_head_k * n_head}, TENSOR_NOT_REQUIRED); layer.bk = create_tensor(tn(LLM_TENSOR_ATTN_K, "bias", i), {n_embd_k_gqa}, TENSOR_NOT_REQUIRED); layer.bv = create_tensor(tn(LLM_TENSOR_ATTN_V, "bias", i), {n_embd_v_gqa}, TENSOR_NOT_REQUIRED); layer.bo = create_tensor(tn(LLM_TENSOR_ATTN_OUT, "bias", i), {n_embd}, TENSOR_NOT_REQUIRED); ``` to: ```cpp layer.wq_b = create_tensor(tn(LLM_TENSOR_ATTN_Q, "bias", i), {n_embd_head_k * n_head}, TENSOR_NOT_REQUIRED); layer.wk_b = create_tensor(tn(LLM_TENSOR_ATTN_K, "bias", i), {n_embd_k_gqa}, TENSOR_NOT_REQUIRED); layer.wv_b = create_tensor(tn(LLM_TENSOR_ATTN_V, "bias", i), {n_embd_v_gqa}, TENSOR_NOT_REQUIRED); layer.wo_b = create_tensor(tn(LLM_TENSOR_ATTN_OUT, "bias", i), {n_embd}, TENSOR_NOT_REQUIRED); ``` - src/models/dflash.cpp follows the same layer.bq -> layer.wq_b layer.bk -> layer.wk_b layer.bv -> layer.wv_b layer.bo -> layer.wo_b - src/models/eagle3.cpp:134: support NVFP4 tensors for Gemma4: https://github.com/ggml-org/llama.cpp/pull/21971/changes#diff-9be9eea14f4aefce7375482c05968900192634e88e92ac263cedb955a64ad7feR2099 ```cpp cur = build_attn(inp_attn, model.layers[il].wo, NULL, NULL, // 3rd tensor parameter (wo_s) Qcur, Kcur, Vcur, nullptr, nullptr, nullptr, kq_scale, il); ``` # 2026-04-20: Support for Qwen3.5/3.6 MoE and notes - https://github.com/ggml-org/llama.cpp/discussions/21569#discussioncomment-16624433 - https://github.com/ggml-org/llama.cpp/pull/22105/changes/d1d2c81caccc748eaaff32b6b7823bad090fd1dd Z Lab's new benchmark - https://huggingface.co/z-lab/Qwen3.6-35B-A3B-DFlash/commit/82252400cd9baebdfa5730b0aa809e10db5dba12 # 2026-04-22: Re-uploaded gguf based on new drafter https://huggingface.co/z-lab/Qwen3.6-35B-A3B-DFlash/commit/31977fbe13a86e8b961774f773058175676d89b8 # Issues and Solutions /src/models/dflash.cpp:39: GGML_ASSERT(model.target_tok_embd != nullptr && "DFlash decoder requires target model's tok_embd") failed check if `--dflash` param is added to the `llama-speculative-simple` test