f16 version please..

#1
by jasoncow - opened

old GPU doesnot support bf16

LM studio 0.4.12 :

🥲 

Failed to load model.

Failed to load model
Owner

Uploaded f16, though I haven’t had the chance to test them yet (ran into some errors and have been busy with work).

It doesn’t appear to be integrated into LM Studio yet, as the PR is still in draft.
From what I gather, the ggml team is refactoring and generalizing the codebase to make it cleaner and more reusable for future development.

Will continue tracking progress in llama.cpp and the PR: https://github.com/ggml-org/llama.cpp/pull/22105

Okay, managed to run the tests.

Turns out I was missing the new --dflash argument in my tests
image

image
Using Qwen3-4B-DFlash-GGUF-Test, base tg speed was roughly ~20 t/s, so it's around 2x speedup on this machine.

had tried CUDA12 windows and Vulkan windows on v2.14.0 LM studio.

The log:
2026-04-24 09:17:52 [DEBUG]
LlamaV4::load called with model path: D:\models\unsloth\lym00\Qwen3.6-35B-A3B-DFlash-GGUF-Test\Qwen3.6-35B-A3B-DFlash-f16.gguf
LlamaV4::load config: n_parallel=4 n_ctx=6144 kv_unified=true
2026-04-24 09:17:52 [DEBUG]
srv load_model: loading model 'D:\models\unsloth\lym00\Qwen3.6-35B-A3B-DFlash-GGUF-Test\Qwen3.6-35B-A3B-DFlash-f16.gguf'
2026-04-24 09:17:52 [DEBUG]
llama_model_load_from_file_impl: using device Vulkan0 (NVIDIA GeForce RTX 5060 Laptop GPU) (0000:64:00.0) - 7042 MiB free
2026-04-24 09:17:52 [DEBUG]
llama_model_loader: loaded meta data with 36 key-value pairs and 91 tensors from D:\models\unsloth\lym00\Qwen3.6-35B-A3B-DFlash-GGUF-Test\Qwen3.6-35B-A3B-DFlash-f16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = dflash
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Qwen3.6 35B A3B DFlash
llama_model_loader: - kv 3: general.finetune str = 35b-DFlash
llama_model_loader: - kv 4: general.basename str = Qwen3.6
llama_model_loader: - kv 5: general.size_label str = A3B
llama_model_loader: - kv 6: dflash.block_count u32 = 8
llama_model_loader: - kv 7: dflash.context_length u32 = 262144
llama_model_loader: - kv 8: dflash.embedding_length u32 = 2048
llama_model_loader: - kv 9: dflash.feed_forward_length u32 = 6144
llama_model_loader: - kv 10: dflash.attention.head_count u32 = 32
llama_model_loader: - kv 11: dflash.attention.head_count_kv u32 = 4
llama_model_loader: - kv 12: dflash.rope.scaling.type str = yarn
llama_model_loader: - kv 13: dflash.rope.scaling.factor f32 = 64.000000
llama_model_loader: - kv 14: dflash.rope.scaling.original_context_length u32 = 4096
llama_model_loader: - kv 15: dflash.rope.scaling.yarn_beta_fast f32 = 32.000000
llama_model_loader: - kv 16: dflash.rope.scaling.yarn_beta_slow f32 = 1.000000
llama_model_loader: - kv 17: dflash.rope.freq_base f32 = 10000000.000000
llama_model_loader: - kv 18: dflash.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 19: dflash.attention.key_length u32 = 128
llama_model_loader: - kv 20: dflash.attention.value_length u32 = 128
llama_model_loader: - kv 21: general.file_type u32 = 1
llama_model_loader: - kv 22: dflash.block_size u32 = 16
llama_model_loader: - kv 23: dflash.target_layer_ids arr[i32,5] = [2, 11, 20, 29, 38]
llama_model_loader: - kv 24: dflash.mask_token_id u32 = 248070
llama_model_loader: - kv 25: general.quantization_version u32 = 2
llama_model_loader: - kv 26: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 27: tokenizer.ggml.pre str = qwen35
2026-04-24 09:17:52 [DEBUG]
llama_model_loader: - kv 28: tokenizer.ggml.tokens arr[str,248320] = ["!", """, "#", "$", "%", "&", "'", ...
2026-04-24 09:17:52 [DEBUG]
llama_model_loader: - kv 29: tokenizer.ggml.token_type arr[i32,248320] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
2026-04-24 09:17:52 [DEBUG]
llama_model_loader: - kv 30: tokenizer.ggml.merges arr[str,247587] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv 31: tokenizer.ggml.eos_token_id u32 = 248046
llama_model_loader: - kv 32: tokenizer.ggml.padding_token_id u32 = 248044
llama_model_loader: - kv 33: tokenizer.ggml.bos_token_id u32 = 248044
llama_model_loader: - kv 34: tokenizer.ggml.add_bos_token bool = false
llama_model_loader: - kv 35: tokenizer.chat_template str = {%- set image_count = namespace(value...
llama_model_loader: - type f32: 34 tensors
llama_model_loader: - type f16: 57 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = F16
print_info: file size = 904.15 MiB (16.00 BPW)
2026-04-24 09:17:52 [DEBUG]
llama_model_load: error loading model: error loading model architecture: unknown model architecture: 'dflash'
llama_model_load_from_file_impl: failed to load model
common_init_from_params: failed to load model 'D:\models\unsloth\lym00\Qwen3.6-35B-A3B-DFlash-GGUF-Test\Qwen3.6-35B-A3B-DFlash-f16.gguf'
srv load_model: failed to load model, 'D:\models\unsloth\lym00\Qwen3.6-35B-A3B-DFlash-GGUF-Test\Qwen3.6-35B-A3B-DFlash-f16.gguf': error loading model: error loading model architecture: unknown model architecture: 'dflash'
2026-04-24 09:17:52 [DEBUG]
[LLMProcess] Failed to load model _0x4fd560 [Error]: Failed to load model.
at _0x3f312b.loadModel (C:\Users\ppatx\AppData\Local\Programs\LM Studio\resources\app.webpack\lib\llmworker.js:1:612618)
at process.processTicksAndRejections (node:internal/process/task_queues:105:5)
at async _0x3f312b.handleMessage (C:\Users\ppatx\AppData\Local\Programs\LM Studio\resources\app.webpack\lib\llmworker.js:1:604827) {
cause: 'Failed to load model',
suggestion: undefined,
errorData: undefined,
data: undefined,
displayData: undefined,
title: 'Failed to load model.'
}

2026-04-24 Quick Recap:

  1. Not yet available in LM Studio
    We’ll need to wait for the GGML team (the upstream llama.cpp) to merge or release support for this feature.
    Given their current priorities (e.g., API refactoring), this won’t happen immediately.

  2. Drafter model must be paired with a target model
    The drafter is not standalone, it requires a corresponding target model to function properly.

Must-Read:

Sign up or log in to comment