Instructions to use lym00/Qwen3.6-35B-A3B-DFlash-GGUF-Test with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use lym00/Qwen3.6-35B-A3B-DFlash-GGUF-Test with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="lym00/Qwen3.6-35B-A3B-DFlash-GGUF-Test", filename="Qwen3.6-35B-A3B-DFlash-bf16.gguf", )
llm.create_chat_completion( messages = "No input example has been defined for this model task." )
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use lym00/Qwen3.6-35B-A3B-DFlash-GGUF-Test with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf lym00/Qwen3.6-35B-A3B-DFlash-GGUF-Test:BF16 # Run inference directly in the terminal: llama-cli -hf lym00/Qwen3.6-35B-A3B-DFlash-GGUF-Test:BF16
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf lym00/Qwen3.6-35B-A3B-DFlash-GGUF-Test:BF16 # Run inference directly in the terminal: llama-cli -hf lym00/Qwen3.6-35B-A3B-DFlash-GGUF-Test:BF16
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf lym00/Qwen3.6-35B-A3B-DFlash-GGUF-Test:BF16 # Run inference directly in the terminal: ./llama-cli -hf lym00/Qwen3.6-35B-A3B-DFlash-GGUF-Test:BF16
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf lym00/Qwen3.6-35B-A3B-DFlash-GGUF-Test:BF16 # Run inference directly in the terminal: ./build/bin/llama-cli -hf lym00/Qwen3.6-35B-A3B-DFlash-GGUF-Test:BF16
Use Docker
docker model run hf.co/lym00/Qwen3.6-35B-A3B-DFlash-GGUF-Test:BF16
- LM Studio
- Jan
- Ollama
How to use lym00/Qwen3.6-35B-A3B-DFlash-GGUF-Test with Ollama:
ollama run hf.co/lym00/Qwen3.6-35B-A3B-DFlash-GGUF-Test:BF16
- Unsloth Studio
How to use lym00/Qwen3.6-35B-A3B-DFlash-GGUF-Test with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for lym00/Qwen3.6-35B-A3B-DFlash-GGUF-Test to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for lym00/Qwen3.6-35B-A3B-DFlash-GGUF-Test to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for lym00/Qwen3.6-35B-A3B-DFlash-GGUF-Test to start chatting
- Pi
How to use lym00/Qwen3.6-35B-A3B-DFlash-GGUF-Test with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf lym00/Qwen3.6-35B-A3B-DFlash-GGUF-Test:BF16
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "lym00/Qwen3.6-35B-A3B-DFlash-GGUF-Test:BF16" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use lym00/Qwen3.6-35B-A3B-DFlash-GGUF-Test with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf lym00/Qwen3.6-35B-A3B-DFlash-GGUF-Test:BF16
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default lym00/Qwen3.6-35B-A3B-DFlash-GGUF-Test:BF16
Run Hermes
hermes
- Docker Model Runner
How to use lym00/Qwen3.6-35B-A3B-DFlash-GGUF-Test with Docker Model Runner:
docker model run hf.co/lym00/Qwen3.6-35B-A3B-DFlash-GGUF-Test:BF16
- Lemonade
How to use lym00/Qwen3.6-35B-A3B-DFlash-GGUF-Test with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull lym00/Qwen3.6-35B-A3B-DFlash-GGUF-Test:BF16
Run and chat with the model
lemonade run user.Qwen3.6-35B-A3B-DFlash-GGUF-Test-BF16
List all available models
lemonade list
f16 version please..
old GPU doesnot support bf16
LM studio 0.4.12 :
🥲
Failed to load model.
Failed to load model
Uploaded f16, though I haven’t had the chance to test them yet (ran into some errors and have been busy with work).
It doesn’t appear to be integrated into LM Studio yet, as the PR is still in draft.
From what I gather, the ggml team is refactoring and generalizing the codebase to make it cleaner and more reusable for future development.
Will continue tracking progress in llama.cpp and the PR: https://github.com/ggml-org/llama.cpp/pull/22105
Okay, managed to run the tests.
Turns out I was missing the new --dflash argument in my tests

Using Qwen3-4B-DFlash-GGUF-Test, base tg speed was roughly ~20 t/s, so it's around 2x speedup on this machine.
had tried CUDA12 windows and Vulkan windows on v2.14.0 LM studio.
The log:
2026-04-24 09:17:52 [DEBUG]
LlamaV4::load called with model path: D:\models\unsloth\lym00\Qwen3.6-35B-A3B-DFlash-GGUF-Test\Qwen3.6-35B-A3B-DFlash-f16.gguf
LlamaV4::load config: n_parallel=4 n_ctx=6144 kv_unified=true
2026-04-24 09:17:52 [DEBUG]
srv load_model: loading model 'D:\models\unsloth\lym00\Qwen3.6-35B-A3B-DFlash-GGUF-Test\Qwen3.6-35B-A3B-DFlash-f16.gguf'
2026-04-24 09:17:52 [DEBUG]
llama_model_load_from_file_impl: using device Vulkan0 (NVIDIA GeForce RTX 5060 Laptop GPU) (0000:64:00.0) - 7042 MiB free
2026-04-24 09:17:52 [DEBUG]
llama_model_loader: loaded meta data with 36 key-value pairs and 91 tensors from D:\models\unsloth\lym00\Qwen3.6-35B-A3B-DFlash-GGUF-Test\Qwen3.6-35B-A3B-DFlash-f16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = dflash
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Qwen3.6 35B A3B DFlash
llama_model_loader: - kv 3: general.finetune str = 35b-DFlash
llama_model_loader: - kv 4: general.basename str = Qwen3.6
llama_model_loader: - kv 5: general.size_label str = A3B
llama_model_loader: - kv 6: dflash.block_count u32 = 8
llama_model_loader: - kv 7: dflash.context_length u32 = 262144
llama_model_loader: - kv 8: dflash.embedding_length u32 = 2048
llama_model_loader: - kv 9: dflash.feed_forward_length u32 = 6144
llama_model_loader: - kv 10: dflash.attention.head_count u32 = 32
llama_model_loader: - kv 11: dflash.attention.head_count_kv u32 = 4
llama_model_loader: - kv 12: dflash.rope.scaling.type str = yarn
llama_model_loader: - kv 13: dflash.rope.scaling.factor f32 = 64.000000
llama_model_loader: - kv 14: dflash.rope.scaling.original_context_length u32 = 4096
llama_model_loader: - kv 15: dflash.rope.scaling.yarn_beta_fast f32 = 32.000000
llama_model_loader: - kv 16: dflash.rope.scaling.yarn_beta_slow f32 = 1.000000
llama_model_loader: - kv 17: dflash.rope.freq_base f32 = 10000000.000000
llama_model_loader: - kv 18: dflash.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 19: dflash.attention.key_length u32 = 128
llama_model_loader: - kv 20: dflash.attention.value_length u32 = 128
llama_model_loader: - kv 21: general.file_type u32 = 1
llama_model_loader: - kv 22: dflash.block_size u32 = 16
llama_model_loader: - kv 23: dflash.target_layer_ids arr[i32,5] = [2, 11, 20, 29, 38]
llama_model_loader: - kv 24: dflash.mask_token_id u32 = 248070
llama_model_loader: - kv 25: general.quantization_version u32 = 2
llama_model_loader: - kv 26: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 27: tokenizer.ggml.pre str = qwen35
2026-04-24 09:17:52 [DEBUG]
llama_model_loader: - kv 28: tokenizer.ggml.tokens arr[str,248320] = ["!", """, "#", "$", "%", "&", "'", ...
2026-04-24 09:17:52 [DEBUG]
llama_model_loader: - kv 29: tokenizer.ggml.token_type arr[i32,248320] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
2026-04-24 09:17:52 [DEBUG]
llama_model_loader: - kv 30: tokenizer.ggml.merges arr[str,247587] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv 31: tokenizer.ggml.eos_token_id u32 = 248046
llama_model_loader: - kv 32: tokenizer.ggml.padding_token_id u32 = 248044
llama_model_loader: - kv 33: tokenizer.ggml.bos_token_id u32 = 248044
llama_model_loader: - kv 34: tokenizer.ggml.add_bos_token bool = false
llama_model_loader: - kv 35: tokenizer.chat_template str = {%- set image_count = namespace(value...
llama_model_loader: - type f32: 34 tensors
llama_model_loader: - type f16: 57 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = F16
print_info: file size = 904.15 MiB (16.00 BPW)
2026-04-24 09:17:52 [DEBUG]
llama_model_load: error loading model: error loading model architecture: unknown model architecture: 'dflash'
llama_model_load_from_file_impl: failed to load model
common_init_from_params: failed to load model 'D:\models\unsloth\lym00\Qwen3.6-35B-A3B-DFlash-GGUF-Test\Qwen3.6-35B-A3B-DFlash-f16.gguf'
srv load_model: failed to load model, 'D:\models\unsloth\lym00\Qwen3.6-35B-A3B-DFlash-GGUF-Test\Qwen3.6-35B-A3B-DFlash-f16.gguf': error loading model: error loading model architecture: unknown model architecture: 'dflash'
2026-04-24 09:17:52 [DEBUG]
[LLMProcess] Failed to load model _0x4fd560 [Error]: Failed to load model.
at _0x3f312b.loadModel (C:\Users\ppatx\AppData\Local\Programs\LM Studio\resources\app.webpack\lib\llmworker.js:1:612618)
at process.processTicksAndRejections (node:internal/process/task_queues:105:5)
at async _0x3f312b.handleMessage (C:\Users\ppatx\AppData\Local\Programs\LM Studio\resources\app.webpack\lib\llmworker.js:1:604827) {
cause: 'Failed to load model',
suggestion: undefined,
errorData: undefined,
data: undefined,
displayData: undefined,
title: 'Failed to load model.'
}
2026-04-24 Quick Recap:
Not yet available in LM Studio
We’ll need to wait for the GGML team (the upstream llama.cpp) to merge or release support for this feature.
Given their current priorities (e.g., API refactoring), this won’t happen immediately.Drafter model must be paired with a target model
The drafter is not standalone, it requires a corresponding target model to function properly.
Must-Read: