Instructions to use unsloth/Qwen3.6-27B-MTP-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use unsloth/Qwen3.6-27B-MTP-GGUF with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="unsloth/Qwen3.6-27B-MTP-GGUF") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("unsloth/Qwen3.6-27B-MTP-GGUF", dtype="auto") - llama-cpp-python
How to use unsloth/Qwen3.6-27B-MTP-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="unsloth/Qwen3.6-27B-MTP-GGUF", filename="BF16/Qwen3.6-27B-BF16-00001-of-00002.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use unsloth/Qwen3.6-27B-MTP-GGUF with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf unsloth/Qwen3.6-27B-MTP-GGUF:UD-Q4_K_XL # Run inference directly in the terminal: llama-cli -hf unsloth/Qwen3.6-27B-MTP-GGUF:UD-Q4_K_XL
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf unsloth/Qwen3.6-27B-MTP-GGUF:UD-Q4_K_XL # Run inference directly in the terminal: llama-cli -hf unsloth/Qwen3.6-27B-MTP-GGUF:UD-Q4_K_XL
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf unsloth/Qwen3.6-27B-MTP-GGUF:UD-Q4_K_XL # Run inference directly in the terminal: ./llama-cli -hf unsloth/Qwen3.6-27B-MTP-GGUF:UD-Q4_K_XL
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf unsloth/Qwen3.6-27B-MTP-GGUF:UD-Q4_K_XL # Run inference directly in the terminal: ./build/bin/llama-cli -hf unsloth/Qwen3.6-27B-MTP-GGUF:UD-Q4_K_XL
Use Docker
docker model run hf.co/unsloth/Qwen3.6-27B-MTP-GGUF:UD-Q4_K_XL
- LM Studio
- Jan
- vLLM
How to use unsloth/Qwen3.6-27B-MTP-GGUF with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "unsloth/Qwen3.6-27B-MTP-GGUF" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "unsloth/Qwen3.6-27B-MTP-GGUF", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/unsloth/Qwen3.6-27B-MTP-GGUF:UD-Q4_K_XL
- SGLang
How to use unsloth/Qwen3.6-27B-MTP-GGUF with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "unsloth/Qwen3.6-27B-MTP-GGUF" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "unsloth/Qwen3.6-27B-MTP-GGUF", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "unsloth/Qwen3.6-27B-MTP-GGUF" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "unsloth/Qwen3.6-27B-MTP-GGUF", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Ollama
How to use unsloth/Qwen3.6-27B-MTP-GGUF with Ollama:
ollama run hf.co/unsloth/Qwen3.6-27B-MTP-GGUF:UD-Q4_K_XL
- Unsloth Studio
How to use unsloth/Qwen3.6-27B-MTP-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for unsloth/Qwen3.6-27B-MTP-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for unsloth/Qwen3.6-27B-MTP-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for unsloth/Qwen3.6-27B-MTP-GGUF to start chatting
- Pi
How to use unsloth/Qwen3.6-27B-MTP-GGUF with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf unsloth/Qwen3.6-27B-MTP-GGUF:UD-Q4_K_XL
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "unsloth/Qwen3.6-27B-MTP-GGUF:UD-Q4_K_XL" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use unsloth/Qwen3.6-27B-MTP-GGUF with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf unsloth/Qwen3.6-27B-MTP-GGUF:UD-Q4_K_XL
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default unsloth/Qwen3.6-27B-MTP-GGUF:UD-Q4_K_XL
Run Hermes
hermes
- Atomic Chat new
- Docker Model Runner
How to use unsloth/Qwen3.6-27B-MTP-GGUF with Docker Model Runner:
docker model run hf.co/unsloth/Qwen3.6-27B-MTP-GGUF:UD-Q4_K_XL
- Lemonade
How to use unsloth/Qwen3.6-27B-MTP-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull unsloth/Qwen3.6-27B-MTP-GGUF:UD-Q4_K_XL
Run and chat with the model
lemonade run user.Qwen3.6-27B-MTP-GGUF-UD-Q4_K_XL
List all available models
lemonade list
Stable MTP first release!
MTP GGUFs are still experimental, but for now they function ok
MTP speculative decoding for ~1.5-2x faster generation — build llama.cpp from the MTP PR branch
Thanks for waiting - all quants should work well now - but remember these are still EXPERIMENTAL until the MTP branch is merged
apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y
git clone -b mtp-clean https://github.com/am17an/llama.cpp.git
cmake llama.cpp -B llama.cpp/build -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON
cmake --build llama.cpp/build --config Release -j --clean-first --target llama-cli llama-server
cp llama.cpp/build/bin/llama-* llama.cpp
export LLAMA_CACHE="unsloth/Qwen3.6-27B-MTP-GGUF"
./llama.cpp/llama-server \
-hf unsloth/Qwen3.6-27B-MTP-GGUF:UD-Q4_K_XL \
-ngl 99 -c 8192 -fa on -np 1 \
--spec-type mtp --spec-draft-n-max 2
Set -DGGML_CUDA=OFF for CPU/Metal. -np > 1 and --mmproj are not yet supported with MTP.
I wish there was an uncensored variant as well
IQ4_NL on a 5060 Ti with 108k context in q4 with ngl 51 , threads 6 on amd 9600x, getting close to 20 tps which is about 30% higher than the non mtp version. but prompt speed is down to half... what can be done to get the 1.5-2x speed up?
lm studio produced (hmm interesting):
llama_model_load: error loading model: missing tensor 'blk.64.ssm_conv1d.weight'
llama_model_load_from_file_impl: failed to load model
2026-05-12 18:30:42 [DEBUG]
common_init_from_params: failed to load model 'C:\Users\faks.lmstudio\models\unsloth\Qwen3.6-27B-MTP-GGUF\Qwen3.6-27B-Q4_K_S.gguf'
srv load_model: failed to load model, 'C:\Users\faks.lmstudio\models\unsloth\Qwen3.6-27B-MTP-GGUF\Qwen3.6-27B-Q4_K_S.gguf': error loading model: missing tensor 'blk.64.ssm_conv1d.weight'
2026-05-12 18:30:42 [DEBUG]
[LLMProcess] Failed to load model _0x580cd5 [Error]: Failed to load model.
at _0x3b146e.loadModel (C:\Users\faks\AppData\Local\Programs\LM Studio\resources\app.webpack\lib\llmworker.js:1:611860)
at process.processTicksAndRejections (node:internal/process/task_queues:105:5)
at async _0x3b146e.handleMessage (C:\Users\faks\AppData\Local\Programs\LM Studio\resources\app.webpack\lib\llmworker.js:1:603899) {
cause: 'Failed to load model',
suggestion: undefined,
errorData: undefined,
data: undefined,
displayData: undefined,
title: 'Failed to load model.'
}
❌ tensor SSM ausente no GGUF UD:
Estou tentando carregar os GGUFs Qwen3.6 UD no llama.cpp mais recente (b9119 / ef93e98d0) e o carregamento falha durante load_tensors.
Os modelos:
Qwen3.6-35B-A3B-UD-Q4_K_M.gguf
Qwen3.6-27B-UD-Q3_K_XL.gguf
falham com erro de tensor SSM ausente:
llama_model_load: error loading model: missing tensor 'blk.40.ssm_conv1d.weight'
e:
llama_model_load: error loading model: missing tensor 'blk.64.ssm_conv1d.weight'
O llama.cpp:
reconhece corretamente qwen35 / qwen35moe
detecta os parâmetros SSM
lê o metadata normalmente
inicia load_tensors
mas falha porque os tensores ssm_conv1d.weight não existem no GGUF.
Informações relevantes:
version: 9119 (ef93e98d0)
GPU:
RTX 5060 Ti
CUDA compute capability 12.0
Os logs mostram suporte SSM detectado:
ssm_d_conv = 4
ssm_d_inner = 6144
Então parece ser problema no export/quantização UD do GGUF, não no llama.cpp.
Does this work for MI50 as well? compiled it with:
HIPCXX="$(hipconfig -l)/clang" HIP_PATH="$(hipconfig -R)"
cmake -S . -B build
-DGGML_HIP=ON
-DGPU_TARGETS=gfx906
-DGGML_SCHED_MAX_COPIES=1
-DLLAMA_CURL=ON
-DLLAMA_OPENSSL=ON
-DCMAKE_BUILD_TYPE=Release &&
cmake --build build --config Release -- -j 8
But zero speedups.
running it with:
sudo ./llama-server
-m /root/.cache/huggingface/hub/models--unsloth--Qwen3.6-27B-MTP-GGUF/snapshots/be86552c0f5725958f7b2d16f97477398fca3f07/Qwen3.6-27B-Q5_K_M.gguf
--no-mmap
--host 0.0.0.0
--port 5000
--ctx-size 131072
--cache-type-k q8_0
--cache-type-v q8_0
--n-gpu-layers 999
-np 1
-fa on
--jinja
--spec-type mtp
--spec-draft-n-max 3
--ubatch-size 512
--batch-size 2048
Just some observations.
Running the non MTP Qwen3.6-27B-Q8_0.gguf on two Tesla P40s gives ~ 8-9 token/s generation speed and prompt processing ~200 token/s
Running the MTP Qwen3.6-27B-Q8_0.gguf version on the same setup gives ~14-15 token/s and prompt processing ~ 120 token/s and 90%+ draft acceptance rate (often times 100%)
The draft portion of the model always seems to want to load completely on the last GPU, setting --device-draft or -mg didn't affect that so --tensor-spit 1,0.7 was used to balance the load
Running Qwen3.6-27B-Q8_0.gguf on two Tesla P40s with a draft model (Qwen3.5-0.8B-Q8_0.gguf) gives about 19-20 Token/s generation speed ~180 token/s prompt processing and anywhere from 60%to 90% draft acceptance rate. It also allows vision and -np > 1 with the drawback of complexity in having to load multiple models.
is the "--spec-type mtp" has change to draft-mtp?
error while handling argument "--spec-type": unknown speculative type: mtp
usage:
--spec-type none,draft-simple,draft-eagle3,draft-mtp,ngram-simple,ngram-map-k,ngram-map-k4v,ngram-mod,ngram-cache
comma-separated list of types of speculative decoding to use (default:
none)
I get repeated errors. If I include "--spec-type mtp --spec-draft-n-max 2", it states that it has no idea what the command "--spec-draft-n-max 2" is.
If I just include "--spec-type mtp", it gives me a different runtime error: ....GGML_ASSERT(strncmp(n->name, LLAMA_TENSOR_NAME_FGDN_CH "-", prefix_len) == 0) failed
If I exclude both "--spec-type mtp --spec-draft-n-max 2", then it launches but with 0 speed benefit (to be expected).
Edit: I downloaded a newer llama.cpp and it works great. I went from around 41 t/s to ~ 90 t/s (5090 rtx). Very cool. Thank you! Oh and that is for just inference tokens / second. Prompt processing is obviously much higher but I havent done any extensive testing yet.
Edit #2: PP is around 2500 t/s. That seems faster than the original before MTP. Nice! Time to try 397B now…
Apple Silicon datapoint if it helps. On this hardware (M4 Max, 36GB) MTP ended up being a net loss.
Setup: Mac on Metal, brew llama.cpp 9200, unsloth/Qwen3.6-27B-MTP-GGUF (UD-Q4_K_XL), real meeting-summary prompt at ~23.5k input tokens, deterministic sampling (--temp 0 --top-k 1), one run per arm.
| Arm | Gen tok/s | Output toks | Wall | Draft accept |
|---|---|---|---|---|
| Baseline, no grammar | 7.98 | 3000 (capped) | 652s | — |
| Baseline + JSON grammar | 7.97 | 1478 (stop) | 441s | — |
| MTP, no grammar | 4.23 | 3000 (capped) | 973s | 32.8% (1988/6063) |
| MTP + JSON grammar | 4.21 | 1477 (stop) | 652s | 35.1% (1003/2856) |
Production-style comparison is MTP+JSON vs baseline+JSON: 652s vs 441s for almost identical 1477-token output. 1.48× slower end to end.
Generation throughput roughly halved with MTP regardless of grammar. Grammar didn't tank acceptance the way I half-expected; JSON acceptance was actually slightly higher than plain (35.1% vs 32.8%), probably because JSON structure is more predictable for the draft head. The catch is that at ~33% acceptance the draft+verify per-step work was bigger than the savings from skipped main-model decodes. Prompt eval was roughly neutral, small regression on MTP+JSON only (92 → 78 tok/s).
Caveats: N=1 prompt, N=1 Mac, and 23k input is on the longer side. Could easily look different on shorter prompts or on CUDA.
