Instructions to use Thireus/mtp-Qwen3.6-27B-THIREUS-Q6_K-SPECIAL_SPLIT with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use Thireus/mtp-Qwen3.6-27B-THIREUS-Q6_K-SPECIAL_SPLIT with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="Thireus/mtp-Qwen3.6-27B-THIREUS-Q6_K-SPECIAL_SPLIT", filename="mtp-Qwen3.6-27B-THIREUS-Q6_K-SPECIAL_TENSOR-00001-of-00019.gguf", )
llm.create_chat_completion( messages = "No input example has been defined for this model task." )
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use Thireus/mtp-Qwen3.6-27B-THIREUS-Q6_K-SPECIAL_SPLIT with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf Thireus/mtp-Qwen3.6-27B-THIREUS-Q6_K-SPECIAL_SPLIT:Q6_K # Run inference directly in the terminal: llama-cli -hf Thireus/mtp-Qwen3.6-27B-THIREUS-Q6_K-SPECIAL_SPLIT:Q6_K
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf Thireus/mtp-Qwen3.6-27B-THIREUS-Q6_K-SPECIAL_SPLIT:Q6_K # Run inference directly in the terminal: llama-cli -hf Thireus/mtp-Qwen3.6-27B-THIREUS-Q6_K-SPECIAL_SPLIT:Q6_K
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf Thireus/mtp-Qwen3.6-27B-THIREUS-Q6_K-SPECIAL_SPLIT:Q6_K # Run inference directly in the terminal: ./llama-cli -hf Thireus/mtp-Qwen3.6-27B-THIREUS-Q6_K-SPECIAL_SPLIT:Q6_K
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf Thireus/mtp-Qwen3.6-27B-THIREUS-Q6_K-SPECIAL_SPLIT:Q6_K # Run inference directly in the terminal: ./build/bin/llama-cli -hf Thireus/mtp-Qwen3.6-27B-THIREUS-Q6_K-SPECIAL_SPLIT:Q6_K
Use Docker
docker model run hf.co/Thireus/mtp-Qwen3.6-27B-THIREUS-Q6_K-SPECIAL_SPLIT:Q6_K
- LM Studio
- Jan
- Ollama
How to use Thireus/mtp-Qwen3.6-27B-THIREUS-Q6_K-SPECIAL_SPLIT with Ollama:
ollama run hf.co/Thireus/mtp-Qwen3.6-27B-THIREUS-Q6_K-SPECIAL_SPLIT:Q6_K
- Unsloth Studio
How to use Thireus/mtp-Qwen3.6-27B-THIREUS-Q6_K-SPECIAL_SPLIT with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for Thireus/mtp-Qwen3.6-27B-THIREUS-Q6_K-SPECIAL_SPLIT to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for Thireus/mtp-Qwen3.6-27B-THIREUS-Q6_K-SPECIAL_SPLIT to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for Thireus/mtp-Qwen3.6-27B-THIREUS-Q6_K-SPECIAL_SPLIT to start chatting
- Pi
How to use Thireus/mtp-Qwen3.6-27B-THIREUS-Q6_K-SPECIAL_SPLIT with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf Thireus/mtp-Qwen3.6-27B-THIREUS-Q6_K-SPECIAL_SPLIT:Q6_K
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "Thireus/mtp-Qwen3.6-27B-THIREUS-Q6_K-SPECIAL_SPLIT:Q6_K" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use Thireus/mtp-Qwen3.6-27B-THIREUS-Q6_K-SPECIAL_SPLIT with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf Thireus/mtp-Qwen3.6-27B-THIREUS-Q6_K-SPECIAL_SPLIT:Q6_K
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default Thireus/mtp-Qwen3.6-27B-THIREUS-Q6_K-SPECIAL_SPLIT:Q6_K
Run Hermes
hermes
- Atomic Chat new
- Docker Model Runner
How to use Thireus/mtp-Qwen3.6-27B-THIREUS-Q6_K-SPECIAL_SPLIT with Docker Model Runner:
docker model run hf.co/Thireus/mtp-Qwen3.6-27B-THIREUS-Q6_K-SPECIAL_SPLIT:Q6_K
- Lemonade
How to use Thireus/mtp-Qwen3.6-27B-THIREUS-Q6_K-SPECIAL_SPLIT with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull Thireus/mtp-Qwen3.6-27B-THIREUS-Q6_K-SPECIAL_SPLIT:Q6_K
Run and chat with the model
lemonade run user.mtp-Qwen3.6-27B-THIREUS-Q6_K-SPECIAL_SPLIT-Q6_K
List all available models
lemonade list
File size: 9,694 Bytes
a391ccc 7c4dd2f | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 | ---
license: mit
---
# mtp-Qwen3.6-27B
## 🤔 What is this [HuggingFace repository](https://huggingface.co/Thireus/mtp-Qwen3.6-27B-THIREUS-BF16-SPECIAL_SPLIT/) about?
This repository provides **GGUF-quantized tensors** for the [mtp](https://github.com/ggml-org/llama.cpp/blob/master/docs/speculative.md) layer(s) of the Qwen3.6-27B model (official repo: https://huggingface.co/Qwen/Qwen3.6-27B). These GGUF shards are designed to be used with **Thireus’ GGUF Tool Suite** (https://github.com/Thireus/GGUF-Tool-Suite), a collection of tools that automatically finds the perplexity-optimal mix of quantizations for any given a model size target. With this GGUF Tool Suite, you can produce your own Dynamic 3.0 Quants recipes and achieve optimum accuracy & SOTA quantization performance. Give it a try here: https://gguf.thireus.com/quant_assign.html
- 📖 Documentation: https://github.com/Thireus/GGUF-Tool-Suite/tree/main/docs
- 🔍 Example of GGUF recipes: https://github.com/Thireus/GGUF-Tool-Suite/tree/main/recipe_examples
- 🍳 Cook your own recipe files: https://gguf.thireus.com/quant_assign.html
- ☁️ Download GGUF models from recipe files: https://gguf.thireus.com/quant_downloader.html
- 📂 Browse available models: https://huggingface.co/Thireus/collections and https://gguf.thireus.com
*tl;dr: Expand the details section below*
<details>
```
cd ~
# Make sure to install all ik_llama.cpp compilation dependencies...
apt install python3-dev python3-pip python3-venv python3-wheel python3-setuptools git acl netcat-openbsd cmake # pipx
# Obtain ik_llama's Thireus version - Windows/macOS/Linux builds available at https://github.com/Thireus/ik_llama.cpp/releases
git clone https://github.com/Thireus/ik_llama.cpp
cd ik_llama.cpp
git pull
# Build ik_llama.cpp
cmake -B build -DGGML_AVX=ON -DGGML_AVX2=ON -DLLAMA_CURL=OFF -DGGML_MAX_CONTEXTS=2048
cmake --build build --config Release -j16
cd ..
# Obtain Thireus' GGUF-Tool-Suite
GIT_LFS_SKIP_SMUDGE=1 git clone https://github.com/Thireus/GGUF-Tool-Suite
# Download model quant mix from recipe file - you can also try the web version: https://gguf.thireus.com/quant_downloader.html
cd GGUF-Tool-Suite
rm -f download.conf # Make sure to copy the relevant download.conf for the model before running quant_assign.py
cp -f models/Qwen3.6-27B/download.conf . # Use the download.conf of the chosen model
mkdir -p kitchen && cd kitchen
# Obtain a recipe example for the chosen model from ../recipe_examples/
../quant_downloader.sh ../recipe_examples/ik_llama.cpp_recipes/Qwen3.6-27B.ROOT-3.5993bpw-11.3565ppl.1GB-GGUF_0GB-GPU_0GB-CPU.9888e4b_831ff04.recipe
# Other recipe examples can be found at https://github.com/Thireus/GGUF-Tool-Suite/tree/main/recipe_examples
# Launch ik_llama's llama-cli:
ulimit -n 9999 # Lifts "too many open files" limitation on Linux
~/ik_llama.cpp/build/bin/llama-server \
-m Qwen3.6-27B-THIREUS-BF16-SPECIAL_TENSOR-00001-of-*.gguf \
-md mtp-Qwen3.6-27B-THIREUS-BF16-SPECIAL_TENSOR-00001-of-*.gguf --spec-type draft-mtp \
-fa auto -amb 1024 -ctk q8_0 -c 32768 -ngl 99 \
-b 4096 -ub 4096 --warmup-batch --no-mmap --threads 1 \
--main-gpu 0
```
</details>
---
## ❓ Why does this Tool Suite exist?
1. **Compatibility & Speed** – [unsloth](https://huggingface.co/unsloth)’s dynamic quants may not always work optimally with `ik_llama.cpp`.
2. **Custom Rig Fit** – No off-the-shelf GGUF model perfectly matched my VRAM/RAM setup, so I built a way to tailor models and leverage extra VRAM/RAM to reduce perplexity.
3. **Automated PPL-Optimal Quantization** – To my knowledge, there was no open source flexible, automated method to minimize perplexity for any bits-per-weight (bpw) target—so I created one with excellent results!
---
## 📊 How does it compare to other GGUFs?
Here’s how Qwen3.6-27B quantized with **Thireus’ GGUF Tool Suite** stacks up against other quantizers (lower perplexity = better at equal or lower bpw):

> _Note: The `recipe_examples` files illustrate good recipes. The Tool Suite computes the optimal ppl/bpw curve for you — just specify your target RAM, VRAM, and quant types, and `quant_assign.py` finds the best mix._
More perplexity/bpw graphs for other supported models: https://github.com/Thireus/GGUF-Tool-Suite/tree/main/ppl_graphs
*Qwen3.6 Thireus' PPL benchmarks are computed with the parameters `-ctk f16 -c 512 -b 512 -ub 512`. Changing any of these parameters will alter the PPL. In particular, reducing `-b 512 -ub 512` increases the PPL, while increasing them decreases the PPL.*
---
## 🚀 How do I get started?
Check out the [GGUF Tool Suite README](https://github.com/Thireus/GGUF-Tool-Suite) — focus on these sections:
1. ⚠️ **Requirements** – Which `ik_llama.cpp` (or `llama.cpp`) version to use and how to compile.
- Windows binaries (no patching needed) at: https://github.com/Thireus/ik_llama.cpp/releases
2. 📥 **Download Model Shards** – Use `quant_downloader.sh` or [quant_downloader.html](https://gguf.thireus.com/quant_downloader.html) to fetch GGUF shards from any recipe.
- Recipe examples: https://github.com/Thireus/GGUF-Tool-Suite/tree/main/recipe_examples
3. 🧠 **Run a Downloaded Model** – Sample usage with `llama-cli`.
4. 🛠️ **Generate a Custom Recipe** – Produce recipes tailored to your VRAM/RAM target usage for optimum perplexity.
---
## ✅ Supported Models
Supported models are listed under `models/` in the [Tool Suite Github repo](https://github.com/Thireus/GGUF-Tool-Suite/tree/main/models). Presence of `ppl_results.csv` indicates official support and compatibility with `quant_assign.py`.
---
## 🤷♂️ Will I release baked dynamic quant GGUFs?
No, because I believe in **tailored quantization** for each user’s hardware. If you prefer ready-made shards, you are welcome to merge them via `llama-gguf-split --merge`, or request someone to publish them, or rely on generic GGUF dynamic quants such as [unsloth](https://huggingface.co/unsloth)'s.
Instead, I prefer to share examples of recipes so users can see exactly how they were produced (command included inside these recipe files) and tweak them for their own rigs. The `quant_downloader.sh` script or [quant_downloader.html](https://gguf.thireus.com/quant_downloader.html) (web port of this script) handles automatic fetching and verification of each shard. Note that recipes provided by [Ubergarm](https://huggingface.co/ubergarm) on his model cards are also compatible with `quant_downloader.sh` and [quant_downloader.html](https://gguf.thireus.com/quant_downloader.html), providing a "SPECIAL_SPLIT" version of these models exists (see https://gguf.thireus.com/).
Users who don’t trust the GGUF shards on HuggingFace can also quantize their own by passing recipe lines to `llama-quantize --custom-q` ([see example](https://github.com/Thireus/GGUF-Tool-Suite/blob/main/models/DeepSeek-R1-0528/DeepSeek-R1-0528-THIREUS-ANY-SPECIAL.sh#L482-L486)). Run `llama-quantize --help` to list compatible quants for `quant_assign.py`. This approach is especially useful if you prefer `llama.cpp` over `ik_llama.cpp`.
---
## 📦 What’s in this repository?
- **00001 GGUF header shard** – Contains metadata (tokens, chat template, tensor count, etc.). This metadata can be explored directly from the HuggingFace web interface after clicking on that shard.
- **Tensor shards** – Each shard holds one tensor; see `tensors.map` for names, quant types, sizes, SHA-256 hash, shard IDs, etc.
- **GPG-signed files** – `tensors.map` and header shard are signed with the key in [trusted-keys.asc](https://github.com/Thireus/GGUF-Tool-Suite/blob/main/trusted-keys.asc) for tamper detection.
- **Security note** – Some papers about various ways to attack GGUFs and LLMs are available online, such as https://arxiv.org/abs/2505.23786, and there are also more classic security exploits like CVE-2024-23496 and CVE-2024-25664 through CVE-2024-25668. Only use GGUFs from reputable, trusted authors—or alternatively self-quantize—to avoid potential exploits.
---
## 💡 Pro Tips
You can easily download the BF16 model version to quantize your own shards:
```
mkdir kitchen
echo '.*=bf16' > kitchen/bf16.recipe
cd kitchen
../quant_downloader.sh bf16.recipe --qtype BF16
```
You can also quantize individual BF16 tensors without the need to download every BF16 .gguf shard:
BF16 model shards can also be individually quantized using a special version of ik_llama.cpp's `llama-quantize` utility which comes with the `--individual-tensors` option.
- Source code: https://github.com/Thireus/ik_llama.cpp/tree/th/quantize_individual_tensors
- Builds (macOS, Windows and Linux): https://github.com/Thireus/ik_llama.cpp/releases/tag/th-quantize_individual_tensors-b4210-7a44805
Usage example:
```
./llama-quantize --keep-split --imatrix imatrix_ubergarm.dat --individual-tensors 2,3,1094 Kimi-K2-Thinking-THIREUS-BF16-SPECIAL_TENSOR-00001-of-01097.gguf my_new_shards.gguf iq3_s 12
```
For more information about how to use it: https://github.com/Thireus/GGUF-Tool-Suite/issues/45
You can produce your own quantized shards from Thireus' special BF16 model using `quantize_model.sh` found on https://github.com/Thireus/GGUF-Tool-Suite, for example:
```
./quantize_model.sh --model "Qwen3.6-122B-A10B" --qtype iq2_xxs
```
You can disable reasoning (thinking) when using jinja templates for supported models:
```
llama-server ... --jinja --chat-template-kwargs '{"enable_thinking": false}'
```
Enjoy optimized quantization! 🎉
|