Stable MTP first release!

#6
by danielhanchen - opened
Unsloth AI org
edited May 13

MTP GGUFs are still experimental, but for now they function ok

MTP speculative decoding for ~1.5-2x faster generation — build llama.cpp from the MTP PR branch

Thanks for waiting - all quants should work well now - but remember these are still EXPERIMENTAL until the MTP branch is merged

apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y
git clone -b mtp-clean https://github.com/am17an/llama.cpp.git
cmake llama.cpp -B llama.cpp/build -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON
cmake --build llama.cpp/build --config Release -j --clean-first --target llama-cli llama-server
cp llama.cpp/build/bin/llama-* llama.cpp
export LLAMA_CACHE="unsloth/Qwen3.6-27B-MTP-GGUF"
./llama.cpp/llama-server \
    -hf unsloth/Qwen3.6-27B-MTP-GGUF:UD-Q4_K_XL \
    -ngl 99 -c 8192 -fa on -np 1 \
    --spec-type mtp --spec-draft-n-max 2

Set -DGGML_CUDA=OFF for CPU/Metal. -np > 1 and --mmproj are not yet supported with MTP.

danielhanchen pinned discussion

I wish there was an uncensored variant as well

IQ4_NL on a 5060 Ti with 108k context in q4 with ngl 51 , threads 6 on amd 9600x, getting close to 20 tps which is about 30% higher than the non mtp version. but prompt speed is down to half... what can be done to get the 1.5-2x speed up?

lm studio produced (hmm interesting):
llama_model_load: error loading model: missing tensor 'blk.64.ssm_conv1d.weight'
llama_model_load_from_file_impl: failed to load model
2026-05-12 18:30:42 [DEBUG]
common_init_from_params: failed to load model 'C:\Users\faks.lmstudio\models\unsloth\Qwen3.6-27B-MTP-GGUF\Qwen3.6-27B-Q4_K_S.gguf'
srv load_model: failed to load model, 'C:\Users\faks.lmstudio\models\unsloth\Qwen3.6-27B-MTP-GGUF\Qwen3.6-27B-Q4_K_S.gguf': error loading model: missing tensor 'blk.64.ssm_conv1d.weight'
2026-05-12 18:30:42 [DEBUG]
[LLMProcess] Failed to load model _0x580cd5 [Error]: Failed to load model.
at _0x3b146e.loadModel (C:\Users\faks\AppData\Local\Programs\LM Studio\resources\app.webpack\lib\llmworker.js:1:611860)
at process.processTicksAndRejections (node:internal/process/task_queues:105:5)
at async _0x3b146e.handleMessage (C:\Users\faks\AppData\Local\Programs\LM Studio\resources\app.webpack\lib\llmworker.js:1:603899) {
cause: 'Failed to load model',
suggestion: undefined,
errorData: undefined,
data: undefined,
displayData: undefined,
title: 'Failed to load model.'
}

❌ tensor SSM ausente no GGUF UD:
Estou tentando carregar os GGUFs Qwen3.6 UD no llama.cpp mais recente (b9119 / ef93e98d0) e o carregamento falha durante load_tensors.

Os modelos:

Qwen3.6-35B-A3B-UD-Q4_K_M.gguf
Qwen3.6-27B-UD-Q3_K_XL.gguf

falham com erro de tensor SSM ausente:

llama_model_load: error loading model: missing tensor 'blk.40.ssm_conv1d.weight'

e:

llama_model_load: error loading model: missing tensor 'blk.64.ssm_conv1d.weight'

O llama.cpp:

reconhece corretamente qwen35 / qwen35moe
detecta os parâmetros SSM
lê o metadata normalmente
inicia load_tensors
mas falha porque os tensores ssm_conv1d.weight não existem no GGUF.

Informações relevantes:

version: 9119 (ef93e98d0)

GPU:

RTX 5060 Ti
CUDA compute capability 12.0

Os logs mostram suporte SSM detectado:

ssm_d_conv = 4
ssm_d_inner = 6144

Então parece ser problema no export/quantização UD do GGUF, não no llama.cpp.

Does this work for MI50 as well? compiled it with:
HIPCXX="$(hipconfig -l)/clang" HIP_PATH="$(hipconfig -R)"
cmake -S . -B build
-DGGML_HIP=ON
-DGPU_TARGETS=gfx906
-DGGML_SCHED_MAX_COPIES=1
-DLLAMA_CURL=ON
-DLLAMA_OPENSSL=ON
-DCMAKE_BUILD_TYPE=Release &&
cmake --build build --config Release -- -j 8

But zero speedups.

running it with:
sudo ./llama-server
-m /root/.cache/huggingface/hub/models--unsloth--Qwen3.6-27B-MTP-GGUF/snapshots/be86552c0f5725958f7b2d16f97477398fca3f07/Qwen3.6-27B-Q5_K_M.gguf
--no-mmap
--host 0.0.0.0
--port 5000
--ctx-size 131072
--cache-type-k q8_0
--cache-type-v q8_0
--n-gpu-layers 999
-np 1
-fa on
--jinja
--spec-type mtp
--spec-draft-n-max 3
--ubatch-size 512
--batch-size 2048

Just some observations.
Running the non MTP Qwen3.6-27B-Q8_0.gguf on two Tesla P40s gives ~ 8-9 token/s generation speed and prompt processing ~200 token/s
Running the MTP Qwen3.6-27B-Q8_0.gguf version on the same setup gives ~14-15 token/s and prompt processing ~ 120 token/s and 90%+ draft acceptance rate (often times 100%)

The draft portion of the model always seems to want to load completely on the last GPU, setting --device-draft or -mg didn't affect that so --tensor-spit 1,0.7 was used to balance the load

Running Qwen3.6-27B-Q8_0.gguf on two Tesla P40s with a draft model (Qwen3.5-0.8B-Q8_0.gguf) gives about 19-20 Token/s generation speed ~180 token/s prompt processing and anywhere from 60%to 90% draft acceptance rate. It also allows vision and -np > 1 with the drawback of complexity in having to load multiple models.

is the "--spec-type mtp" has change to draft-mtp?

error while handling argument "--spec-type": unknown speculative type: mtp
usage:
--spec-type none,draft-simple,draft-eagle3,draft-mtp,ngram-simple,ngram-map-k,ngram-map-k4v,ngram-mod,ngram-cache
comma-separated list of types of speculative decoding to use (default:
none)

I get repeated errors. If I include "--spec-type mtp --spec-draft-n-max 2", it states that it has no idea what the command "--spec-draft-n-max 2" is.
If I just include "--spec-type mtp", it gives me a different runtime error: ....GGML_ASSERT(strncmp(n->name, LLAMA_TENSOR_NAME_FGDN_CH "-", prefix_len) == 0) failed
If I exclude both "--spec-type mtp --spec-draft-n-max 2", then it launches but with 0 speed benefit (to be expected).

Edit: I downloaded a newer llama.cpp and it works great. I went from around 41 t/s to ~ 90 t/s (5090 rtx). Very cool. Thank you! Oh and that is for just inference tokens / second. Prompt processing is obviously much higher but I havent done any extensive testing yet.

Edit #2: PP is around 2500 t/s. That seems faster than the original before MTP. Nice! Time to try 397B now…

danielhanchen unpinned discussion

Apple Silicon datapoint if it helps. On this hardware (M4 Max, 36GB) MTP ended up being a net loss.

Setup: Mac on Metal, brew llama.cpp 9200, unsloth/Qwen3.6-27B-MTP-GGUF (UD-Q4_K_XL), real meeting-summary prompt at ~23.5k input tokens, deterministic sampling (--temp 0 --top-k 1), one run per arm.

Arm Gen tok/s Output toks Wall Draft accept
Baseline, no grammar 7.98 3000 (capped) 652s
Baseline + JSON grammar 7.97 1478 (stop) 441s
MTP, no grammar 4.23 3000 (capped) 973s 32.8% (1988/6063)
MTP + JSON grammar 4.21 1477 (stop) 652s 35.1% (1003/2856)

Production-style comparison is MTP+JSON vs baseline+JSON: 652s vs 441s for almost identical 1477-token output. 1.48× slower end to end.

Generation throughput roughly halved with MTP regardless of grammar. Grammar didn't tank acceptance the way I half-expected; JSON acceptance was actually slightly higher than plain (35.1% vs 32.8%), probably because JSON structure is more predictable for the draft head. The catch is that at ~33% acceptance the draft+verify per-step work was bigger than the savings from skipped main-model decodes. Prompt eval was roughly neutral, small regression on MTP+JSON only (92 → 78 tok/s).

Caveats: N=1 prompt, N=1 Mac, and 23k input is on the longer side. Could easily look different on shorter prompts or on CUDA.

I had the same error in LM Studio.

I resolved it by using the beta version of LM Studio (LM Studio 0.4.14 (Build 3)) and updating the runtime versions.

imagen

Regards,

I had the same error in LM Studio.

I resolved it by using the beta version of LM Studio (LM Studio 0.4.14 (Build 3)) and updating the runtime versions.

imagen

Regards,

works fine with stable version, my bad sorry guys.

shimmyshimmer changed discussion status to closed

Sign up or log in to comment