Instructions to use GazTrab/Qwen3.6-27B-MTP-UD-IQ3_XXS-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use GazTrab/Qwen3.6-27B-MTP-UD-IQ3_XXS-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="GazTrab/Qwen3.6-27B-MTP-UD-IQ3_XXS-GGUF", filename="Qwen3.6-27B-MTP-UD-IQ3_XXS.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use GazTrab/Qwen3.6-27B-MTP-UD-IQ3_XXS-GGUF with llama.cpp:
Install (macOS, Linux)
curl -LsSf https://llama.app/install.sh | sh # Start a local OpenAI-compatible server with a web UI: llama serve -hf GazTrab/Qwen3.6-27B-MTP-UD-IQ3_XXS-GGUF:UD-IQ3_XXS # Run inference directly in the terminal: llama cli -hf GazTrab/Qwen3.6-27B-MTP-UD-IQ3_XXS-GGUF:UD-IQ3_XXS
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama serve -hf GazTrab/Qwen3.6-27B-MTP-UD-IQ3_XXS-GGUF:UD-IQ3_XXS # Run inference directly in the terminal: llama cli -hf GazTrab/Qwen3.6-27B-MTP-UD-IQ3_XXS-GGUF:UD-IQ3_XXS
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf GazTrab/Qwen3.6-27B-MTP-UD-IQ3_XXS-GGUF:UD-IQ3_XXS # Run inference directly in the terminal: ./llama-cli -hf GazTrab/Qwen3.6-27B-MTP-UD-IQ3_XXS-GGUF:UD-IQ3_XXS
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf GazTrab/Qwen3.6-27B-MTP-UD-IQ3_XXS-GGUF:UD-IQ3_XXS # Run inference directly in the terminal: ./build/bin/llama-cli -hf GazTrab/Qwen3.6-27B-MTP-UD-IQ3_XXS-GGUF:UD-IQ3_XXS
Use Docker
docker model run hf.co/GazTrab/Qwen3.6-27B-MTP-UD-IQ3_XXS-GGUF:UD-IQ3_XXS
- LM Studio
- Jan
- vLLM
How to use GazTrab/Qwen3.6-27B-MTP-UD-IQ3_XXS-GGUF with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "GazTrab/Qwen3.6-27B-MTP-UD-IQ3_XXS-GGUF" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "GazTrab/Qwen3.6-27B-MTP-UD-IQ3_XXS-GGUF", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/GazTrab/Qwen3.6-27B-MTP-UD-IQ3_XXS-GGUF:UD-IQ3_XXS
- Ollama
How to use GazTrab/Qwen3.6-27B-MTP-UD-IQ3_XXS-GGUF with Ollama:
ollama run hf.co/GazTrab/Qwen3.6-27B-MTP-UD-IQ3_XXS-GGUF:UD-IQ3_XXS
- Unsloth Studio
How to use GazTrab/Qwen3.6-27B-MTP-UD-IQ3_XXS-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for GazTrab/Qwen3.6-27B-MTP-UD-IQ3_XXS-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for GazTrab/Qwen3.6-27B-MTP-UD-IQ3_XXS-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for GazTrab/Qwen3.6-27B-MTP-UD-IQ3_XXS-GGUF to start chatting
- Pi
How to use GazTrab/Qwen3.6-27B-MTP-UD-IQ3_XXS-GGUF with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama serve -hf GazTrab/Qwen3.6-27B-MTP-UD-IQ3_XXS-GGUF:UD-IQ3_XXS
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "GazTrab/Qwen3.6-27B-MTP-UD-IQ3_XXS-GGUF:UD-IQ3_XXS" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use GazTrab/Qwen3.6-27B-MTP-UD-IQ3_XXS-GGUF with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama serve -hf GazTrab/Qwen3.6-27B-MTP-UD-IQ3_XXS-GGUF:UD-IQ3_XXS
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default GazTrab/Qwen3.6-27B-MTP-UD-IQ3_XXS-GGUF:UD-IQ3_XXS
Run Hermes
hermes
- Atomic Chat new
- Docker Model Runner
How to use GazTrab/Qwen3.6-27B-MTP-UD-IQ3_XXS-GGUF with Docker Model Runner:
docker model run hf.co/GazTrab/Qwen3.6-27B-MTP-UD-IQ3_XXS-GGUF:UD-IQ3_XXS
- Lemonade
How to use GazTrab/Qwen3.6-27B-MTP-UD-IQ3_XXS-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull GazTrab/Qwen3.6-27B-MTP-UD-IQ3_XXS-GGUF:UD-IQ3_XXS
Run and chat with the model
lemonade run user.Qwen3.6-27B-MTP-UD-IQ3_XXS-GGUF-UD-IQ3_XXS
List all available models
lemonade list
llm.create_chat_completion(
messages = [
{
"role": "user",
"content": "What is the capital of France?"
}
]
)Qwen3.6-27B-MTP-UD-IQ3_XXS GGUF
Qwen3.6-27B dense model with Multi-Token Prediction (MTP) head, quantized to IQ3_XXS using Unsloth Dynamic quantization.
This GGUF was created by grafting an MTP prediction head (block 64) onto the Unsloth IQ3_XXS base model, enabling speculative decoding without a separate draft model.
Key Specs
| Property | Value |
|---|---|
| Architecture | Qwen3.6 (hybrid SSM + attention, dense) |
| Parameters | 27.3B |
| Active parameters | 27.3B (dense, all active per token) |
| Quantization | IQ3_XXS (Unsloth Dynamic 2.0) |
| MTP layers | 1 (nextn_predict_layers=1) |
| Block count | 65 (64 base + 1 MTP) |
| Tensors | 866 |
| File size | 12.45 GB |
| Context length | 262,144 native (tested up to 56k with MTP) |
Performance (RTX 5080 16GB)
Benchmarked with llama.cpp (s015-mtp build, am17an PR #22673):
| Metric | Value |
|---|---|
| Token generation (avg) | 76 tok/s with MTP, 53 tok/s without |
| Token generation (peak) | 102 tok/s on code tasks |
| MTP acceptance rate | 90.6% aggregate (95-100% on code, 62-85% on creative) |
| GPU layers | 66/66 (fits entirely on 16 GB) |
| VRAM usage | ~12.5 GB model + 150 MiB recurrent state |
| GSM8K accuracy | 89/100 (89.0%, Wilson CI [85.7%, 96.4%]) |
CodeNeedle Positional Recall (http_server.py, 11 functions, ~50k char context)
| Config | Pass | Lines matched | Hallucinated |
|---|---|---|---|
| This model (q8_0 KV, 32k ctx) | 11/11 | 220/220 (100%) | 0 |
| This model (q4_0 KV, 56k ctx) | 11/11 | 218/220 (99.1%) | 1 |
| 35B-A3B MoE UD-Q4_K_XL | 11/11 | 206/220 (93.6%) | 12 |
| 27B MTP Q2_K_XL | 10/11 | 199/220 (90.5%) | 20 |
How to Use
Requires a llama.cpp build with MTP support (am17an's mtp-clean branch, PR #22673).
llama-server \
-m Qwen3.6-27B-MTP-UD-IQ3_XXS.gguf \
-c 32768 \
--fit on \
--spec-type mtp \
-fa on \
-t 20 \
--no-mmap \
--jinja \
-ctk q8_0 -ctv q8_0
Important: --spec-type mtp must be explicitly passed to enable MTP speculation. Without it, the model loads the MTP head but doesn't draft tokens (~56 tok/s instead of ~76 tok/s).
Extended Context with q4_0 KV
For longer contexts (up to 56k stable), use q4_0 KV cache:
-c 57344 -ctk q4_0 -ctv q4_0 --spec-type mtp
q4_0 KV is near-lossless (218/220 CodeNeedle at 56k) and extends max stable context from 32k to 56k. Beyond 56k, the MTP compute buffer OOMs on 16 GB VRAM.
How This Was Made
- Base model: unsloth/Qwen3.6-27B-UD-IQ3_XXS (12 GB, 851 tensors, 64 blocks)
- MTP head: Extracted from havenoammo's MTP GGUF collection โ 15 tensors for block 64 (attention + FFN + nextn prediction head), Q8_0 quantized, 436 MB
- Graft: Custom Python script using the
gguflibrary (GGUFReader + GGUFWriter). Copies all base tensors + MTP tensors, setsblock_count=65, addsnextn_predict_layers=1. SHA256-verified integrity.
The graft script is available at: scripts/graft-mtp.py (adapt for other base models)
Why IQ3_XXS + MTP?
The "dream config" thesis: a 27B dense model at IQ3_XXS fits entirely on a 16 GB GPU (no PCIe bottleneck), while MTP provides free speculative decoding. This combination delivers:
- Higher quality than MoE: 220/220 CodeNeedle vs 206/220 for 35B MoE (no expert routing = more coherent at low quant)
- Faster than MoE: 76 tok/s vs 50 tok/s (no expert loading over PCIe)
- Smaller than MoE: 12.45 GB vs 21 GB (fits fully on GPU with room to spare)
Limitations
- MTP requires a custom llama.cpp build (not yet in mainline as of May 2026, but PR #22673 is close to merging)
- TurboQuant KV cache (
turbo4) is not compatible with the current MTP builds (build incompatibility, not a model issue) - Max stable context with MTP is ~56k on 16 GB VRAM (compute buffer OOM beyond that)
- Without MTP, this is just a standard IQ3_XXS model running at ~53 tok/s
Credits
- Unsloth โ IQ3_XXS base quantization (Dynamic 2.0)
- havenoammo โ MTP head tensors + graft concept
- am17an โ llama.cpp MTP implementation (PR #22673)
- Qwen Team โ Qwen3.6-27B base model
- Downloads last month
- 428
3-bit
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="GazTrab/Qwen3.6-27B-MTP-UD-IQ3_XXS-GGUF", filename="Qwen3.6-27B-MTP-UD-IQ3_XXS.gguf", )