Instructions to use lym00/Qwen3.6-35B-A3B-DFlash-GGUF-Test with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use lym00/Qwen3.6-35B-A3B-DFlash-GGUF-Test with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="lym00/Qwen3.6-35B-A3B-DFlash-GGUF-Test",
	filename="Qwen3.6-35B-A3B-DFlash-bf16.gguf",
)

llm.create_chat_completion(
	messages = "No input example has been defined for this model task."
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use lym00/Qwen3.6-35B-A3B-DFlash-GGUF-Test with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf lym00/Qwen3.6-35B-A3B-DFlash-GGUF-Test:BF16
# Run inference directly in the terminal:
llama-cli -hf lym00/Qwen3.6-35B-A3B-DFlash-GGUF-Test:BF16

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf lym00/Qwen3.6-35B-A3B-DFlash-GGUF-Test:BF16
# Run inference directly in the terminal:
llama-cli -hf lym00/Qwen3.6-35B-A3B-DFlash-GGUF-Test:BF16

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf lym00/Qwen3.6-35B-A3B-DFlash-GGUF-Test:BF16
# Run inference directly in the terminal:
./llama-cli -hf lym00/Qwen3.6-35B-A3B-DFlash-GGUF-Test:BF16

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf lym00/Qwen3.6-35B-A3B-DFlash-GGUF-Test:BF16
# Run inference directly in the terminal:
./build/bin/llama-cli -hf lym00/Qwen3.6-35B-A3B-DFlash-GGUF-Test:BF16

Use Docker

docker model run hf.co/lym00/Qwen3.6-35B-A3B-DFlash-GGUF-Test:BF16

LM Studio
Jan
Ollama
How to use lym00/Qwen3.6-35B-A3B-DFlash-GGUF-Test with Ollama:
```
ollama run hf.co/lym00/Qwen3.6-35B-A3B-DFlash-GGUF-Test:BF16
```

Unsloth Studio

How to use lym00/Qwen3.6-35B-A3B-DFlash-GGUF-Test with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for lym00/Qwen3.6-35B-A3B-DFlash-GGUF-Test to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for lym00/Qwen3.6-35B-A3B-DFlash-GGUF-Test to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for lym00/Qwen3.6-35B-A3B-DFlash-GGUF-Test to start chatting

How to use lym00/Qwen3.6-35B-A3B-DFlash-GGUF-Test with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf lym00/Qwen3.6-35B-A3B-DFlash-GGUF-Test:BF16

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "lym00/Qwen3.6-35B-A3B-DFlash-GGUF-Test:BF16"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use lym00/Qwen3.6-35B-A3B-DFlash-GGUF-Test with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf lym00/Qwen3.6-35B-A3B-DFlash-GGUF-Test:BF16

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default lym00/Qwen3.6-35B-A3B-DFlash-GGUF-Test:BF16

Run Hermes

hermes

Docker Model Runner
How to use lym00/Qwen3.6-35B-A3B-DFlash-GGUF-Test with Docker Model Runner:
```
docker model run hf.co/lym00/Qwen3.6-35B-A3B-DFlash-GGUF-Test:BF16
```

Lemonade

How to use lym00/Qwen3.6-35B-A3B-DFlash-GGUF-Test with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull lym00/Qwen3.6-35B-A3B-DFlash-GGUF-Test:BF16

Run and chat with the model

lemonade run user.Qwen3.6-35B-A3B-DFlash-GGUF-Test-BF16

List all available models

lemonade list

f16 version please..

by jasoncow - opened Apr 20

Discussion

jasoncow

Apr 20

old GPU doesnot support bf16

throcky

Apr 21

LM studio 0.4.12 :

🥲 

Failed to load model.

Failed to load model

lym00

Owner Apr 21

Uploaded f16, though I haven’t had the chance to test them yet (ran into some errors and have been busy with work).

It doesn’t appear to be integrated into LM Studio yet, as the PR is still in draft.
From what I gather, the ggml team is refactoring and generalizing the codebase to make it cleaner and more reusable for future development.

Will continue tracking progress in llama.cpp and the PR: https://github.com/ggml-org/llama.cpp/pull/22105

lym00

Owner Apr 21

•

edited Apr 21

Okay, managed to run the tests.

Turns out I was missing the new --dflash argument in my tests

Using Qwen3-4B-DFlash-GGUF-Test, base tg speed was roughly ~20 t/s, so it's around 2x speedup on this machine.

throcky

Apr 24

had tried CUDA12 windows and Vulkan windows on v2.14.0 LM studio.

The log:
2026-04-24 09:17:52 [DEBUG]
LlamaV4::load called with model path: D:\models\unsloth\lym00\Qwen3.6-35B-A3B-DFlash-GGUF-Test\Qwen3.6-35B-A3B-DFlash-f16.gguf
LlamaV4::load config: n_parallel=4 n_ctx=6144 kv_unified=true
2026-04-24 09:17:52 [DEBUG]
srv load_model: loading model 'D:\models\unsloth\lym00\Qwen3.6-35B-A3B-DFlash-GGUF-Test\Qwen3.6-35B-A3B-DFlash-f16.gguf'
2026-04-24 09:17:52 [DEBUG]
llama_model_load_from_file_impl: using device Vulkan0 (NVIDIA GeForce RTX 5060 Laptop GPU) (0000:64:00.0) - 7042 MiB free
2026-04-24 09:17:52 [DEBUG]
llama_model_loader: loaded meta data with 36 key-value pairs and 91 tensors from D:\models\unsloth\lym00\Qwen3.6-35B-A3B-DFlash-GGUF-Test\Qwen3.6-35B-A3B-DFlash-f16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = dflash
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Qwen3.6 35B A3B DFlash
llama_model_loader: - kv 3: general.finetune str = 35b-DFlash
llama_model_loader: - kv 4: general.basename str = Qwen3.6
llama_model_loader: - kv 5: general.size_label str = A3B
llama_model_loader: - kv 6: dflash.block_count u32 = 8
llama_model_loader: - kv 7: dflash.context_length u32 = 262144
llama_model_loader: - kv 8: dflash.embedding_length u32 = 2048
llama_model_loader: - kv 9: dflash.feed_forward_length u32 = 6144
llama_model_loader: - kv 10: dflash.attention.head_count u32 = 32
llama_model_loader: - kv 11: dflash.attention.head_count_kv u32 = 4
llama_model_loader: - kv 12: dflash.rope.scaling.type str = yarn
llama_model_loader: - kv 13: dflash.rope.scaling.factor f32 = 64.000000
llama_model_loader: - kv 14: dflash.rope.scaling.original_context_length u32 = 4096
llama_model_loader: - kv 15: dflash.rope.scaling.yarn_beta_fast f32 = 32.000000
llama_model_loader: - kv 16: dflash.rope.scaling.yarn_beta_slow f32 = 1.000000
llama_model_loader: - kv 17: dflash.rope.freq_base f32 = 10000000.000000
llama_model_loader: - kv 18: dflash.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 19: dflash.attention.key_length u32 = 128
llama_model_loader: - kv 20: dflash.attention.value_length u32 = 128
llama_model_loader: - kv 21: general.file_type u32 = 1
llama_model_loader: - kv 22: dflash.block_size u32 = 16
llama_model_loader: - kv 23: dflash.target_layer_ids arr[i32,5] = [2, 11, 20, 29, 38]
llama_model_loader: - kv 24: dflash.mask_token_id u32 = 248070
llama_model_loader: - kv 25: general.quantization_version u32 = 2
llama_model_loader: - kv 26: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 27: tokenizer.ggml.pre str = qwen35
2026-04-24 09:17:52 [DEBUG]
llama_model_loader: - kv 28: tokenizer.ggml.tokens arr[str,248320] = ["!", """, "#", "$", "%", "&", "'", ...
2026-04-24 09:17:52 [DEBUG]
llama_model_loader: - kv 29: tokenizer.ggml.token_type arr[i32,248320] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
2026-04-24 09:17:52 [DEBUG]
llama_model_loader: - kv 30: tokenizer.ggml.merges arr[str,247587] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv 31: tokenizer.ggml.eos_token_id u32 = 248046
llama_model_loader: - kv 32: tokenizer.ggml.padding_token_id u32 = 248044
llama_model_loader: - kv 33: tokenizer.ggml.bos_token_id u32 = 248044
llama_model_loader: - kv 34: tokenizer.ggml.add_bos_token bool = false
llama_model_loader: - kv 35: tokenizer.chat_template str = {%- set image_count = namespace(value...
llama_model_loader: - type f32: 34 tensors
llama_model_loader: - type f16: 57 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = F16
print_info: file size = 904.15 MiB (16.00 BPW)
2026-04-24 09:17:52 [DEBUG]
llama_model_load: error loading model: error loading model architecture: unknown model architecture: 'dflash'
llama_model_load_from_file_impl: failed to load model
common_init_from_params: failed to load model 'D:\models\unsloth\lym00\Qwen3.6-35B-A3B-DFlash-GGUF-Test\Qwen3.6-35B-A3B-DFlash-f16.gguf'
srv load_model: failed to load model, 'D:\models\unsloth\lym00\Qwen3.6-35B-A3B-DFlash-GGUF-Test\Qwen3.6-35B-A3B-DFlash-f16.gguf': error loading model: error loading model architecture: unknown model architecture: 'dflash'
2026-04-24 09:17:52 [DEBUG]
[LLMProcess] Failed to load model _0x4fd560 [Error]: Failed to load model.
at _0x3f312b.loadModel (C:\Users\ppatx\AppData\Local\Programs\LM Studio\resources\app.webpack\lib\llmworker.js:1:612618)
at process.processTicksAndRejections (node:internal/process/task_queues:105:5)
at async _0x3f312b.handleMessage (C:\Users\ppatx\AppData\Local\Programs\LM Studio\resources\app.webpack\lib\llmworker.js:1:604827) {
cause: 'Failed to load model',
suggestion: undefined,
errorData: undefined,
data: undefined,
displayData: undefined,
title: 'Failed to load model.'
}

lym00

Owner Apr 24

•

edited Apr 25

2026-04-24 Quick Recap:

Not yet available in LM Studio
We’ll need to wait for the GGML team (the upstream llama.cpp) to merge or release support for this feature.
Given their current priorities (e.g., API refactoring), this won’t happen immediately.
Drafter model must be paired with a target model
The drafter is not standalone, it requires a corresponding target model to function properly.

Must-Read:

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment