How to use from the
Use from the
llama-cpp-python library
# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="GazTrab/Qwen3.6-27B-MTP-UD-IQ3_XXS-GGUF",
	filename="Qwen3.6-27B-MTP-UD-IQ3_XXS.gguf",
)
llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Qwen3.6-27B-MTP-UD-IQ3_XXS GGUF

Qwen3.6-27B dense model with Multi-Token Prediction (MTP) head, quantized to IQ3_XXS using Unsloth Dynamic quantization.

This GGUF was created by grafting an MTP prediction head (block 64) onto the Unsloth IQ3_XXS base model, enabling speculative decoding without a separate draft model.

Key Specs

Property Value
Architecture Qwen3.6 (hybrid SSM + attention, dense)
Parameters 27.3B
Active parameters 27.3B (dense, all active per token)
Quantization IQ3_XXS (Unsloth Dynamic 2.0)
MTP layers 1 (nextn_predict_layers=1)
Block count 65 (64 base + 1 MTP)
Tensors 866
File size 12.45 GB
Context length 262,144 native (tested up to 56k with MTP)

Performance (RTX 5080 16GB)

Benchmarked with llama.cpp (s015-mtp build, am17an PR #22673):

Metric Value
Token generation (avg) 76 tok/s with MTP, 53 tok/s without
Token generation (peak) 102 tok/s on code tasks
MTP acceptance rate 90.6% aggregate (95-100% on code, 62-85% on creative)
GPU layers 66/66 (fits entirely on 16 GB)
VRAM usage ~12.5 GB model + 150 MiB recurrent state
GSM8K accuracy 89/100 (89.0%, Wilson CI [85.7%, 96.4%])

CodeNeedle Positional Recall (http_server.py, 11 functions, ~50k char context)

Config Pass Lines matched Hallucinated
This model (q8_0 KV, 32k ctx) 11/11 220/220 (100%) 0
This model (q4_0 KV, 56k ctx) 11/11 218/220 (99.1%) 1
35B-A3B MoE UD-Q4_K_XL 11/11 206/220 (93.6%) 12
27B MTP Q2_K_XL 10/11 199/220 (90.5%) 20

How to Use

Requires a llama.cpp build with MTP support (am17an's mtp-clean branch, PR #22673).

llama-server \
  -m Qwen3.6-27B-MTP-UD-IQ3_XXS.gguf \
  -c 32768 \
  --fit on \
  --spec-type mtp \
  -fa on \
  -t 20 \
  --no-mmap \
  --jinja \
  -ctk q8_0 -ctv q8_0

Important: --spec-type mtp must be explicitly passed to enable MTP speculation. Without it, the model loads the MTP head but doesn't draft tokens (~56 tok/s instead of ~76 tok/s).

Extended Context with q4_0 KV

For longer contexts (up to 56k stable), use q4_0 KV cache:

-c 57344 -ctk q4_0 -ctv q4_0 --spec-type mtp

q4_0 KV is near-lossless (218/220 CodeNeedle at 56k) and extends max stable context from 32k to 56k. Beyond 56k, the MTP compute buffer OOMs on 16 GB VRAM.

How This Was Made

  1. Base model: unsloth/Qwen3.6-27B-UD-IQ3_XXS (12 GB, 851 tensors, 64 blocks)
  2. MTP head: Extracted from havenoammo's MTP GGUF collection โ€” 15 tensors for block 64 (attention + FFN + nextn prediction head), Q8_0 quantized, 436 MB
  3. Graft: Custom Python script using the gguf library (GGUFReader + GGUFWriter). Copies all base tensors + MTP tensors, sets block_count=65, adds nextn_predict_layers=1. SHA256-verified integrity.

The graft script is available at: scripts/graft-mtp.py (adapt for other base models)

Why IQ3_XXS + MTP?

The "dream config" thesis: a 27B dense model at IQ3_XXS fits entirely on a 16 GB GPU (no PCIe bottleneck), while MTP provides free speculative decoding. This combination delivers:

  • Higher quality than MoE: 220/220 CodeNeedle vs 206/220 for 35B MoE (no expert routing = more coherent at low quant)
  • Faster than MoE: 76 tok/s vs 50 tok/s (no expert loading over PCIe)
  • Smaller than MoE: 12.45 GB vs 21 GB (fits fully on GPU with room to spare)

Limitations

  • MTP requires a custom llama.cpp build (not yet in mainline as of May 2026, but PR #22673 is close to merging)
  • TurboQuant KV cache (turbo4) is not compatible with the current MTP builds (build incompatibility, not a model issue)
  • Max stable context with MTP is ~56k on 16 GB VRAM (compute buffer OOM beyond that)
  • Without MTP, this is just a standard IQ3_XXS model running at ~53 tok/s

Credits

  • Unsloth โ€” IQ3_XXS base quantization (Dynamic 2.0)
  • havenoammo โ€” MTP head tensors + graft concept
  • am17an โ€” llama.cpp MTP implementation (PR #22673)
  • Qwen Team โ€” Qwen3.6-27B base model
Downloads last month
428
GGUF
Model size
27B params
Architecture
qwen35
Hardware compatibility
Log In to add your hardware

3-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support