Instructions to use unsloth/Qwen3.6-27B-MTP-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use unsloth/Qwen3.6-27B-MTP-GGUF with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="unsloth/Qwen3.6-27B-MTP-GGUF")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("unsloth/Qwen3.6-27B-MTP-GGUF", dtype="auto")

llama-cpp-python

How to use unsloth/Qwen3.6-27B-MTP-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="unsloth/Qwen3.6-27B-MTP-GGUF",
	filename="BF16/Qwen3.6-27B-BF16-00001-of-00002.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": [
				{
					"type": "text",
					"text": "Describe this image in one sentence."
				},
				{
					"type": "image_url",
					"image_url": {
						"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
					}
				}
			]
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use unsloth/Qwen3.6-27B-MTP-GGUF with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf unsloth/Qwen3.6-27B-MTP-GGUF:UD-Q4_K_XL
# Run inference directly in the terminal:
llama-cli -hf unsloth/Qwen3.6-27B-MTP-GGUF:UD-Q4_K_XL

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf unsloth/Qwen3.6-27B-MTP-GGUF:UD-Q4_K_XL
# Run inference directly in the terminal:
llama-cli -hf unsloth/Qwen3.6-27B-MTP-GGUF:UD-Q4_K_XL

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf unsloth/Qwen3.6-27B-MTP-GGUF:UD-Q4_K_XL
# Run inference directly in the terminal:
./llama-cli -hf unsloth/Qwen3.6-27B-MTP-GGUF:UD-Q4_K_XL

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf unsloth/Qwen3.6-27B-MTP-GGUF:UD-Q4_K_XL
# Run inference directly in the terminal:
./build/bin/llama-cli -hf unsloth/Qwen3.6-27B-MTP-GGUF:UD-Q4_K_XL

Use Docker

docker model run hf.co/unsloth/Qwen3.6-27B-MTP-GGUF:UD-Q4_K_XL

LM Studio
Jan

vLLM

How to use unsloth/Qwen3.6-27B-MTP-GGUF with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "unsloth/Qwen3.6-27B-MTP-GGUF"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "unsloth/Qwen3.6-27B-MTP-GGUF",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/unsloth/Qwen3.6-27B-MTP-GGUF:UD-Q4_K_XL

SGLang

How to use unsloth/Qwen3.6-27B-MTP-GGUF with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "unsloth/Qwen3.6-27B-MTP-GGUF" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "unsloth/Qwen3.6-27B-MTP-GGUF",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "unsloth/Qwen3.6-27B-MTP-GGUF" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "unsloth/Qwen3.6-27B-MTP-GGUF",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Ollama
How to use unsloth/Qwen3.6-27B-MTP-GGUF with Ollama:
```
ollama run hf.co/unsloth/Qwen3.6-27B-MTP-GGUF:UD-Q4_K_XL
```

Unsloth Studio

How to use unsloth/Qwen3.6-27B-MTP-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for unsloth/Qwen3.6-27B-MTP-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for unsloth/Qwen3.6-27B-MTP-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for unsloth/Qwen3.6-27B-MTP-GGUF to start chatting

How to use unsloth/Qwen3.6-27B-MTP-GGUF with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf unsloth/Qwen3.6-27B-MTP-GGUF:UD-Q4_K_XL

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "unsloth/Qwen3.6-27B-MTP-GGUF:UD-Q4_K_XL"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use unsloth/Qwen3.6-27B-MTP-GGUF with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf unsloth/Qwen3.6-27B-MTP-GGUF:UD-Q4_K_XL

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default unsloth/Qwen3.6-27B-MTP-GGUF:UD-Q4_K_XL

Run Hermes

hermes

Atomic Chat new
Docker Model Runner
How to use unsloth/Qwen3.6-27B-MTP-GGUF with Docker Model Runner:
```
docker model run hf.co/unsloth/Qwen3.6-27B-MTP-GGUF:UD-Q4_K_XL
```

Lemonade

How to use unsloth/Qwen3.6-27B-MTP-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull unsloth/Qwen3.6-27B-MTP-GGUF:UD-Q4_K_XL

Run and chat with the model

lemonade run user.Qwen3.6-27B-MTP-GGUF-UD-Q4_K_XL

List all available models

lemonade list

Stable MTP first release!

by danielhanchen - opened May 12

Discussion

danielhanchen

Unsloth AI org May 12

•

edited May 13

MTP GGUFs are still experimental, but for now they function ok

MTP speculative decoding for ~1.5-2x faster generation — build llama.cpp from the MTP PR branch

Thanks for waiting - all quants should work well now - but remember these are still EXPERIMENTAL until the MTP branch is merged

apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y
git clone -b mtp-clean https://github.com/am17an/llama.cpp.git
cmake llama.cpp -B llama.cpp/build -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON
cmake --build llama.cpp/build --config Release -j --clean-first --target llama-cli llama-server
cp llama.cpp/build/bin/llama-* llama.cpp

export LLAMA_CACHE="unsloth/Qwen3.6-27B-MTP-GGUF"
./llama.cpp/llama-server \
    -hf unsloth/Qwen3.6-27B-MTP-GGUF:UD-Q4_K_XL \
    -ngl 99 -c 8192 -fa on -np 1 \
    --spec-type mtp --spec-draft-n-max 2

Set -DGGML_CUDA=OFF for CPU/Metal. -np > 1 and --mmproj are not yet supported with MTP.

danielhanchen pinned discussion May 12

JohnUser

May 12

I wish there was an uncensored variant as well

mayankiit04

May 12

IQ4_NL on a 5060 Ti with 108k context in q4 with ngl 51 , threads 6 on amd 9600x, getting close to 20 tps which is about 30% higher than the non mtp version. but prompt speed is down to half... what can be done to get the 1.5-2x speed up?

hakerx1

May 12

•

edited May 12

lm studio produced (hmm interesting):
llama_model_load: error loading model: missing tensor 'blk.64.ssm_conv1d.weight'
llama_model_load_from_file_impl: failed to load model
2026-05-12 18:30:42 [DEBUG]
common_init_from_params: failed to load model 'C:\Users\faks.lmstudio\models\unsloth\Qwen3.6-27B-MTP-GGUF\Qwen3.6-27B-Q4_K_S.gguf'
srv load_model: failed to load model, 'C:\Users\faks.lmstudio\models\unsloth\Qwen3.6-27B-MTP-GGUF\Qwen3.6-27B-Q4_K_S.gguf': error loading model: missing tensor 'blk.64.ssm_conv1d.weight'
2026-05-12 18:30:42 [DEBUG]
[LLMProcess] Failed to load model _0x580cd5 [Error]: Failed to load model.
at _0x3b146e.loadModel (C:\Users\faks\AppData\Local\Programs\LM Studio\resources\app.webpack\lib\llmworker.js:1:611860)
at process.processTicksAndRejections (node:internal/process/task_queues:105:5)
at async _0x3b146e.handleMessage (C:\Users\faks\AppData\Local\Programs\LM Studio\resources\app.webpack\lib\llmworker.js:1:603899) {
cause: 'Failed to load model',
suggestion: undefined,
errorData: undefined,
data: undefined,
displayData: undefined,
title: 'Failed to load model.'
}

JeanLima2024

May 12

❌ tensor SSM ausente no GGUF UD:
Estou tentando carregar os GGUFs Qwen3.6 UD no llama.cpp mais recente (b9119 / ef93e98d0) e o carregamento falha durante load_tensors.

Os modelos:

Qwen3.6-35B-A3B-UD-Q4_K_M.gguf
Qwen3.6-27B-UD-Q3_K_XL.gguf

falham com erro de tensor SSM ausente:

llama_model_load: error loading model: missing tensor 'blk.40.ssm_conv1d.weight'

llama_model_load: error loading model: missing tensor 'blk.64.ssm_conv1d.weight'

O llama.cpp:

reconhece corretamente qwen35 / qwen35moe
detecta os parâmetros SSM
lê o metadata normalmente
inicia load_tensors
mas falha porque os tensores ssm_conv1d.weight não existem no GGUF.

Informações relevantes:

version: 9119 (ef93e98d0)

GPU:

RTX 5060 Ti
CUDA compute capability 12.0

Os logs mostram suporte SSM detectado:

ssm_d_conv = 4
ssm_d_inner = 6144

Então parece ser problema no export/quantização UD do GGUF, não no llama.cpp.

gurkburk

May 12

Does this work for MI50 as well? compiled it with:
HIPCXX="$(hipconfig -l)/clang" HIP_PATH="$(hipconfig -R)"
cmake -S . -B build
-DGGML_HIP=ON
-DGPU_TARGETS=gfx906
-DGGML_SCHED_MAX_COPIES=1
-DLLAMA_CURL=ON
-DLLAMA_OPENSSL=ON
-DCMAKE_BUILD_TYPE=Release &&
cmake --build build --config Release -- -j 8

But zero speedups.

running it with:
sudo ./llama-server
-m /root/.cache/huggingface/hub/models--unsloth--Qwen3.6-27B-MTP-GGUF/snapshots/be86552c0f5725958f7b2d16f97477398fca3f07/Qwen3.6-27B-Q5_K_M.gguf
--no-mmap
--host 0.0.0.0
--port 5000
--ctx-size 131072
--cache-type-k q8_0
--cache-type-v q8_0
--n-gpu-layers 999
-np 1
-fa on
--jinja
--spec-type mtp
--spec-draft-n-max 3
--ubatch-size 512
--batch-size 2048

Mikeczyk

May 13

•

edited May 13

Just some observations.
Running the non MTP Qwen3.6-27B-Q8_0.gguf on two Tesla P40s gives ~ 8-9 token/s generation speed and prompt processing ~200 token/s
Running the MTP Qwen3.6-27B-Q8_0.gguf version on the same setup gives ~14-15 token/s and prompt processing ~ 120 token/s and 90%+ draft acceptance rate (often times 100%)

The draft portion of the model always seems to want to load completely on the last GPU, setting --device-draft or -mg didn't affect that so --tensor-spit 1,0.7 was used to balance the load

Running Qwen3.6-27B-Q8_0.gguf on two Tesla P40s with a draft model (Qwen3.5-0.8B-Q8_0.gguf) gives about 19-20 Token/s generation speed ~180 token/s prompt processing and anywhere from 60%to 90% draft acceptance rate. It also allows vision and -np > 1 with the drawback of complexity in having to load multiple models.

evankuo

May 13

is the "--spec-type mtp" has change to draft-mtp?

error while handling argument "--spec-type": unknown speculative type: mtp
usage:
--spec-type none,draft-simple,draft-eagle3,draft-mtp,ngram-simple,ngram-map-k,ngram-map-k4v,ngram-mod,ngram-cache
comma-separated list of types of speculative decoding to use (default:
none)

x-polyglot-x

May 13

•

edited 26 days ago

I get repeated errors. If I include "--spec-type mtp --spec-draft-n-max 2", it states that it has no idea what the command "--spec-draft-n-max 2" is.
If I just include "--spec-type mtp", it gives me a different runtime error: ....GGML_ASSERT(strncmp(n->name, LLAMA_TENSOR_NAME_FGDN_CH "-", prefix_len) == 0) failed
If I exclude both "--spec-type mtp --spec-draft-n-max 2", then it launches but with 0 speed benefit (to be expected).

Edit: I downloaded a newer llama.cpp and it works great. I went from around 41 t/s to ~ 90 t/s (5090 rtx). Very cool. Thank you! Oh and that is for just inference tokens / second. Prompt processing is obviously much higher but I havent done any extensive testing yet.

Edit #2: PP is around 2500 t/s. That seems faster than the original before MTP. Nice! Time to try 397B now…

danielhanchen unpinned discussion May 16

jp29

29 days ago

Apple Silicon datapoint if it helps. On this hardware (M4 Max, 36GB) MTP ended up being a net loss.

Setup: Mac on Metal, brew llama.cpp 9200, unsloth/Qwen3.6-27B-MTP-GGUF (UD-Q4_K_XL), real meeting-summary prompt at ~23.5k input tokens, deterministic sampling (--temp 0 --top-k 1), one run per arm.

Arm	Gen tok/s	Output toks	Wall	Draft accept
Baseline, no grammar	7.98	3000 (capped)	652s	—
Baseline + JSON grammar	7.97	1478 (stop)	441s	—
MTP, no grammar	4.23	3000 (capped)	973s	32.8% (1988/6063)
MTP + JSON grammar	4.21	1477 (stop)	652s	35.1% (1003/2856)

Production-style comparison is MTP+JSON vs baseline+JSON: 652s vs 441s for almost identical 1477-token output. 1.48× slower end to end.

Generation throughput roughly halved with MTP regardless of grammar. Grammar didn't tank acceptance the way I half-expected; JSON acceptance was actually slightly higher than plain (35.1% vs 32.8%), probably because JSON structure is more predictable for the draft head. The catch is that at ~33% acceptance the draft+verify per-step work was bigger than the savings from skipped main-model decodes. Prompt eval was roughly neutral, small regression on MTP+JSON only (92 → 78 tok/s).

Caveats: N=1 prompt, N=1 Mac, and 23k input is on the longer side. Could easily look different on shorter prompts or on CUDA.

slovera

25 days ago

•

edited 25 days ago

I had the same error in LM Studio.

I resolved it by using the beta version of LM Studio (LM Studio 0.4.14 (Build 3)) and updating the runtime versions.

Regards,

hakerx1

23 days ago

I had the same error in LM Studio.

I resolved it by using the beta version of LM Studio (LM Studio 0.4.14 (Build 3)) and updating the runtime versions.

Regards,

works fine with stable version, my bad sorry guys.

shimmyshimmer changed discussion status to closed 11 days ago

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment