Instructions to use lym00/Qwen3.6-35B-A3B-DFlash-GGUF-Test with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use lym00/Qwen3.6-35B-A3B-DFlash-GGUF-Test with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="lym00/Qwen3.6-35B-A3B-DFlash-GGUF-Test", filename="Qwen3.6-35B-A3B-DFlash-bf16.gguf", )
llm.create_chat_completion( messages = "No input example has been defined for this model task." )
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use lym00/Qwen3.6-35B-A3B-DFlash-GGUF-Test with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf lym00/Qwen3.6-35B-A3B-DFlash-GGUF-Test:BF16 # Run inference directly in the terminal: llama-cli -hf lym00/Qwen3.6-35B-A3B-DFlash-GGUF-Test:BF16
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf lym00/Qwen3.6-35B-A3B-DFlash-GGUF-Test:BF16 # Run inference directly in the terminal: llama-cli -hf lym00/Qwen3.6-35B-A3B-DFlash-GGUF-Test:BF16
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf lym00/Qwen3.6-35B-A3B-DFlash-GGUF-Test:BF16 # Run inference directly in the terminal: ./llama-cli -hf lym00/Qwen3.6-35B-A3B-DFlash-GGUF-Test:BF16
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf lym00/Qwen3.6-35B-A3B-DFlash-GGUF-Test:BF16 # Run inference directly in the terminal: ./build/bin/llama-cli -hf lym00/Qwen3.6-35B-A3B-DFlash-GGUF-Test:BF16
Use Docker
docker model run hf.co/lym00/Qwen3.6-35B-A3B-DFlash-GGUF-Test:BF16
- LM Studio
- Jan
- Ollama
How to use lym00/Qwen3.6-35B-A3B-DFlash-GGUF-Test with Ollama:
ollama run hf.co/lym00/Qwen3.6-35B-A3B-DFlash-GGUF-Test:BF16
- Unsloth Studio
How to use lym00/Qwen3.6-35B-A3B-DFlash-GGUF-Test with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for lym00/Qwen3.6-35B-A3B-DFlash-GGUF-Test to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for lym00/Qwen3.6-35B-A3B-DFlash-GGUF-Test to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for lym00/Qwen3.6-35B-A3B-DFlash-GGUF-Test to start chatting
- Pi
How to use lym00/Qwen3.6-35B-A3B-DFlash-GGUF-Test with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf lym00/Qwen3.6-35B-A3B-DFlash-GGUF-Test:BF16
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "lym00/Qwen3.6-35B-A3B-DFlash-GGUF-Test:BF16" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use lym00/Qwen3.6-35B-A3B-DFlash-GGUF-Test with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf lym00/Qwen3.6-35B-A3B-DFlash-GGUF-Test:BF16
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default lym00/Qwen3.6-35B-A3B-DFlash-GGUF-Test:BF16
Run Hermes
hermes
- Docker Model Runner
How to use lym00/Qwen3.6-35B-A3B-DFlash-GGUF-Test with Docker Model Runner:
docker model run hf.co/lym00/Qwen3.6-35B-A3B-DFlash-GGUF-Test:BF16
- Lemonade
How to use lym00/Qwen3.6-35B-A3B-DFlash-GGUF-Test with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull lym00/Qwen3.6-35B-A3B-DFlash-GGUF-Test:BF16
Run and chat with the model
lemonade run user.Qwen3.6-35B-A3B-DFlash-GGUF-Test-BF16
List all available models
lemonade list
llama.cpp Pull Request: https://github.com/ggml-org/llama.cpp/pull/22105
DFlash Drafter: https://huggingface.co/z-lab/Qwen3.6-35B-A3B-DFlash
Steps (follow the PR)
git clone -b dflash https://github.com/ruixiang63/llama.cppdownload draft model from https://huggingface.co/z-lab/Qwen3.6-35B-A3B-DFlash/
download tokenizer from https://huggingface.co/Qwen/Qwen3.6-35B-A3B
convert draft model to gguf
python convert_hf_to_gguf.py ../Qwen3.6-35B-A3B-DFlash --outtype bf16 --target-model-dir ../Qwen3.6-35B-A3B --outfile ../Qwen3.6-35B-A3B-DFlash/Qwen3.6-35B-A3B-DFlash-bf16.gguf
- Build llama.cpp
- CUDA
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j
- VULKAN
cmake -B build -DGGML_VULKAN=ON
cmake --build build --config Release -j
- Run DFlash speculative decoding
# thinking off: set LLAMA_SPEC_NO_THINK=1
# Omit it to test thinking-mode behavior
export LLAMA_SPEC_NO_THINK=1
for prompt in \
"Write a quicksort algorithm in Python. Write code only." \
"Explain the Pythagorean theorem" \
"Plan a 1 day trip to DC"; do
echo "=== Prompt: $prompt ==="
./build/bin/llama-speculative-simple \
-m "${TARGET_MODEL_GGUF}" \
-md "${DFLASH_MODEL_GGUF}" \
--dflash -p "$prompt" -n 256 \
--draft-max 16 \
-cd 512 -c 1024 \
--temp 0 --top-k 1 --seed 42 \
-ngl 99 -ngld 99
done
Tests and Investigations
2026-04-19:
Rebase dflash feature onto latest master
git clone -b master https://github.com/ggml-org/llama.cpp
git remote add ruixiang63 https://github.com/ruixiang63/llama.cpp
git fetch ruixiang63
git checkout -b dflash-test origin/master
git merge ruixiang63/dflash --no-edit
Then solve conflicts manually
- gguf-py/gguf/constants.py
- src/CMakeLists.txt
- src/llama-arch.cpp
- src/llama-hparams.h
- src/llama-model.cpp
Notes
src/CMakeLists.txt Use glob to collect src/models sources: https://github.com/ggml-org/llama.cpp/pull/22005/changes
src/llama-arch.cpp remove per-arch tensor name lists: https://github.com/ggml-org/llama.cpp/pull/21531/changes
src/llama-model.cpp Refactor bias tensor variable names: https://github.com/ggml-org/llama.cpp/pull/22079/changes#diff-36e262e316ec1404e29880eb8b8ce4660ac584f0d0434710efc48a66497bdb59
from:
layer.bq = create_tensor(tn(LLM_TENSOR_ATTN_Q, "bias", i), {n_embd_head_k * n_head}, TENSOR_NOT_REQUIRED);
layer.bk = create_tensor(tn(LLM_TENSOR_ATTN_K, "bias", i), {n_embd_k_gqa}, TENSOR_NOT_REQUIRED);
layer.bv = create_tensor(tn(LLM_TENSOR_ATTN_V, "bias", i), {n_embd_v_gqa}, TENSOR_NOT_REQUIRED);
layer.bo = create_tensor(tn(LLM_TENSOR_ATTN_OUT, "bias", i), {n_embd}, TENSOR_NOT_REQUIRED);
to:
layer.wq_b = create_tensor(tn(LLM_TENSOR_ATTN_Q, "bias", i), {n_embd_head_k * n_head}, TENSOR_NOT_REQUIRED);
layer.wk_b = create_tensor(tn(LLM_TENSOR_ATTN_K, "bias", i), {n_embd_k_gqa}, TENSOR_NOT_REQUIRED);
layer.wv_b = create_tensor(tn(LLM_TENSOR_ATTN_V, "bias", i), {n_embd_v_gqa}, TENSOR_NOT_REQUIRED);
layer.wo_b = create_tensor(tn(LLM_TENSOR_ATTN_OUT, "bias", i), {n_embd}, TENSOR_NOT_REQUIRED);
- src/models/dflash.cpp follows the same
layer.bq -> layer.wq_b
layer.bk -> layer.wk_b
layer.bv -> layer.wv_b
layer.bo -> layer.wo_b
- src/models/eagle3.cpp:134: support NVFP4 tensors for Gemma4: https://github.com/ggml-org/llama.cpp/pull/21971/changes#diff-9be9eea14f4aefce7375482c05968900192634e88e92ac263cedb955a64ad7feR2099
cur = build_attn(inp_attn,
model.layers[il].wo, NULL, NULL, // 3rd tensor parameter (wo_s)
Qcur, Kcur, Vcur, nullptr, nullptr, nullptr, kq_scale, il);
2026-04-20:
Support for Qwen3.5/3.6 MoE and notes
https://github.com/ggml-org/llama.cpp/discussions/21569#discussioncomment-16624433
https://github.com/ggml-org/llama.cpp/pull/22105/changes/d1d2c81caccc748eaaff32b6b7823bad090fd1dd
Z Lab's new benchmark
2026-04-22:
Re-uploaded gguf based on new drafter https://huggingface.co/z-lab/Qwen3.6-35B-A3B-DFlash/commit/31977fbe13a86e8b961774f773058175676d89b8
Issues and Solutions
/src/models/dflash.cpp:39: GGML_ASSERT(model.target_tok_embd != nullptr && "DFlash decoder requires target model's tok_embd") failed
check if --dflash param is added to the llama-speculative-simple test
- Downloads last month
- 581
8-bit
16-bit
docker model run hf.co/lym00/Qwen3.6-35B-A3B-DFlash-GGUF-Test: