- Qwen3.6-27B with MTP
- Up to 2.7ร faster with MTP ยท 262K context on 48 GB ยท Fixed chat template
Qwen3.6-27B with MTP
Up to 2.7ร faster with MTP ยท 262K context on 48 GB ยท Fixed chat template
Dense 27B model with vision, thinking, and tool use โ self-speculative decoding,
configurable KV cache, fixed Jinja template (tool calls and thinking actually work in C++ runtimes),
and a server with both OpenAI and Anthropic APIs.
Start the server
You need llama.cpp b9180 or newer (released 2026-05-16, includes MTP support). Install via Homebrew:
brew install llama.cpp
llama-server -m Qwen3.6-27B-Q5_K_M-mtp.gguf \
--mmproj mmproj-Qwen3.6-27B-f16.gguf \
--spec-type draft-mtp --spec-draft-n-max 3 \
-c 262144 -fa off --n-predict -1 --temp 0.6 --top-p 0.95 --top-k 20 --repeat-penalty 1.0 \
-ngl 99 --port 8081
| Flag | What it does | Impact |
|---|---|---|
--mmproj mmproj-Qwen3.6-27B-f16.gguf |
Vision encoder (text + image input) | Multimodal support (+0.9 GB) |
--spec-type draft-mtp --spec-draft-n-max 3 |
Multi-Token Prediction (built into the model) | Up to 2.7ร faster generation |
-c 262144 |
262K context window | Full native context on 64 GB Mac |
-fa off |
Disable Flash Attention | 37โ53% faster prefill on Apple Silicon at long context |
Sampling is set for coding tasks (temp 0.6, top_p 0.95). Adjust -m and -c for your hardware โ see the quant table below. For general chat, change to --temp 0.7 --top-p 0.80. Drop --mmproj if you don't need vision.
Optional flags
8-bit KV cache โ halves KV memory at minor quality cost. Use when f16 KV doesn't give enough context:
--cache-type-k q8_0 --cache-type-v q8_0
Custom chat template โ override the embedded template. Use this if your runtime doesn't support the bundled Jinja template, or if you need the official Qwen template instead of the fixed one:
--jinja --chat-template-file chat_template.jinja
MTP Speculative Decoding โ Should You Enable It?
MTP predicts extra tokens per step using the model's own MTP heads, then verifies them in one pass. No extra model or VRAM needed โ it's built into the weights. But it doesn't help equally for everything.
What controls the speedup is not your quant or temperature โ it's what you're generating.
Recommendation Matrix
| Use case | Q4_K_M | Q5_K_M | Q6_K | Q8_0 | F16 |
|---|---|---|---|---|---|
| Coding / debugging | ๐ข | ๐ข | ๐ข | ๐ข | ๐ข |
| Factual Q&A / translation | ๐ก | ๐ข | ๐ข | ๐ข | ๐ข |
| Analysis / comparisons | ๐ด | ๐ก | ๐ก | ๐ข | ๐ข |
| Creative writing / roleplay | ๐ด | ๐ด | ๐ด | ๐ข | ๐ข |
๐ข speeds up ยท ๐ก marginal ยท ๐ด slower with MTP
Rules of thumb:
- Q8_0 and F16: always enable MTP โ even creative writing gets +48โ67%
- Coding at any quant: keep it on
- Q4_K_MโQ6_K creative tasks: turn it off (
--spec-type none)
Why Task Type Dominates
Draft token acceptance by task type (percentage of predicted tokens that are correct):
| Task | Acceptance | Examples |
|---|---|---|
| Code | 79โ89% | Functions, debugging, refactoring |
| Factual | 62โ70% | Definitions, translation, math proofs |
| Analysis | 48โ56% | Tradeoff breakdowns, comparisons |
| Creative | 39โ48% | Stories, poetry, brainstorming, roleplay |
A 40-point spread from code to creative. Temperature (0.0โ0.7) and quant level barely move the needle. What you're generating matters 40ร more than any other setting.
Speedup by Quant ร Task
Measured on M2 Max 96 GB, temp 0.7, N=3 draft tokens, long generation (2500 tokens):
| Quant | Base speed | Code | Factual | Analysis | Creative |
|---|---|---|---|---|---|
| F16 | 6.6 tok/s | +171% | +125% | +91% | +67% |
| Q8_0 | 11.4 tok/s | +123% | +90% | +64% | +48% |
| Q6_K | 13.4 tok/s | +50% | +31% | +13% | โ1% |
| Q5_K_M | 13.1 tok/s | +47% | +26% | +12% | โ4% |
| Q4_K_M | 15.1 tok/s | +31% | +16% | โ1% | โ9% |
Why does F16 benefit most? F16 at 51 GB crawls at 6.6 tok/s because every token means dragging the full model through memory. Accepted MTP drafts skip that expensive pass. Q4_K_M at 16 GB is already fast enough that the draft overhead is barely worth it on anything less predictable than code.
Draft Token Count
N=3 is optimal for all quants except F16 (where N=4 edges ahead: 17.9 vs 16.2 tok/s). Higher values waste compute on rejected tokens. Lower is too conservative.
Thinking Mode
With thinking enabled for coding tasks, Q8_0 draft acceptance drops from 87% to 73%. Still +94% speedup โ keep MTP on.
About these numbers
The comprehensive table above was measured with the original MTP implementation (llama.cpp PR #22673, the custom build that first added MTP support). Current mainline llama.cpp (b9180+, including Homebrew) gives ~10โ17% lower MTP speedup due to implementation differences. Verified on mainline b9260:
| Quant | Base speed | Code (mainline) | Creative (mainline) |
|---|---|---|---|
| Q8_0 | 11.4 tok/s | +86% (21.2 tok/s) | +32% (15.1 tok/s) |
| Q4_K_M | 15.2 tok/s | +18% (17.9 tok/s) | marginal (12.3 tok/s) |
The recommendation matrix above is based on relative patterns that are identical on both builds โ the advice doesn't change.
Which quant should I download?
Find your hardware below โ each row gives the best quant, KV cache type, and max context that fits.
Apple Silicon
Qwen3.6-27B is a hybrid model โ only 16 of 65 layers use KV cache (verified). The other 48 are linear attention (fixed 150 MiB recurrent state). KV memory is ~4ร less than a standard dense model. Runtimes that don't handle this (e.g. vllm) allocate KV for all 65 layers and show much higher memory usage.
Numbers below include all measured overhead (GPU compute buffers, CPU model/compute buffers โ ~4% of total). Must leave โฅ 8 GB for macOS (24 GB Macs: 6 GB; 16 GB Macs: 4 GB). Plus 2 GB safety margin.
| RAM | Quant | KV cache | Max context | Total used | Vision |
|---|---|---|---|---|---|
| 16 GB | IQ2_M |
q8_0 |
65K | 12.0 GB | โ |
| 24 GB | IQ3_M |
45K | 16.0 GB | โ | |
| 24 GB | IQ3_M |
q8_0 |
85K | 16.0 GB | โ |
| 32 GB | Q4_K_M |
77K | 22.0 GB | โ | |
| 32 GB | Q4_K_M |
q8_0 |
128K | 21.4 GB | โ |
| 32 GB | Q5_K_M |
34K | 22.0 GB | โ | |
| 36 GB | Q5_K_M |
q8_0 |
165K | 26.0 GB | โ |
| 36 GB | Q6_K |
45K | 26.0 GB | โ | |
| 48 GB | Q6_K |
q8_0 |
262K | 32.3 GB | โ |
| 48 GB | Q5_K_M |
262K | 36.5 GB | โ | |
| 48 GB | Q8_0 |
q8_0 |
243K | 38.0 GB | โ |
| 64 GB | Q8_0 |
262K | 45.9 GB | โ | |
| 64 GB | F16 |
37K | 54.0 GB | โ | |
| 96 GB | F16 |
262K | 68.4 GB | โ | |
| 128 GB | F16 |
262K | 68.4 GB | โ |
NVIDIA GPU
Same model memory as Apple Silicon, plus ~1 GB CUDA overhead. Numbers include 2 GB safety margin.
| VRAM | Quant | KV cache | Max context | Total VRAM used | Vision |
|---|---|---|---|---|---|
| 16 GB | IQ2_M |
q8_0 |
67K | 13.0 GB | โ |
| 24 GB | Q4_K_M |
61K | 21.0 GB | โ | |
| 24 GB | Q4_K_M |
q8_0 |
115K | 21.0 GB | โ |
| 24 GB | IQ3_M |
125K | 21.0 GB | โ | |
| 48 GB | Q6_K |
262K | 39.8 GB | โ | |
| 48 GB | Q8_0 |
q8_0 |
262K | 38.4 GB | โ |
| 80 GB | Q8_0 |
262K | 45.9 GB | โ | |
| 80 GB | F16 |
262K | 68.4 GB | โ |
Quick picks: 16 GB Mac โ
IQ2_Mยท 24 GB Mac โIQ3_Mยท 32 GB Mac โQ4_K_Mยท 36 GB Mac โQ5_K_Mยท 48 GB Mac โQ6_Kยท 64 GB Mac โQ8_0ยท 96 GB+ Mac โF16Leave KV cache at f16 (blank column) for best quality. Use
q8_0KV only when f16 doesn't give enough context.q4_0KV should not exceed 64K context.Vision adds ~0.9 GB for mmproj. macOS needs โฅ 8 GB for itself (24 GB Macs: 6 GB; 16 GB Macs: 4 GB). You can increase available memory:
sudo sysctl iogpu.wired_limit_mb=90112(88 GB on a 96 GB Mac). NVIDIA reserves ~1 GB for CUDA.
API usage
The server provides both OpenAI and Anthropic APIs.
OpenAI-compatible (/v1/chat/completions)
curl http://localhost:8081/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"qwen","messages":[{"role":"user","content":"Hello"}]}'
Works with any OpenAI client โ point it at http://localhost:8081/v1.
Anthropic-compatible (/v1/messages)
curl http://localhost:8081/v1/messages \
-H "Content-Type: application/json" \
-d '{"model":"qwen","max_tokens":1024,"messages":[{"role":"user","content":"Hello"}]}'
Works with any Anthropic client. Supports streaming, tool use, and vision.
Claude Code
ANTHROPIC_BASE_URL=http://127.0.0.1:8081 claude
Tool use
curl http://localhost:8081/v1/messages \
-H "Content-Type: application/json" \
-d '{
"model": "qwen",
"max_tokens": 1024,
"tools": [{
"name": "get_weather",
"description": "Get current weather for a location",
"input_schema": {
"type": "object",
"properties": {"location": {"type": "string"}},
"required": ["location"]
}
}],
"messages": [{"role": "user", "content": "What is the weather in Paris?"}]
}'
Vision
The main server command above already includes --mmproj. Just send an image:
curl http://localhost:8081/v1/messages \
-H "Content-Type: application/json" \
-d '{
"model": "qwen",
"max_tokens": 1024,
"messages": [{"role": "user", "content": [
{"type": "image", "source": {"type": "base64", "media_type": "image/jpeg", "data": "'$(base64 < photo.jpg)'"}},
{"type": "text", "text": "Describe this image"}
]}]
}'
Note: Vision + MTP works on llama.cpp b9240+. Older builds (PR #22673) crashed when combining vision with MTP โ fixed in mainline.
Direct CLI usage
llama-cli -m Qwen3.6-27B-Q5_K_M-mtp.gguf \
--mmproj mmproj-Qwen3.6-27B-f16.gguf \
--spec-type draft-mtp --spec-draft-n-max 3 \
-c 4096 -n 2048 --temp 0.6 --top-p 0.95 --top-k 20 --repeat-penalty 1.0 \
-ngl 99 -p "Your prompt here"
Downloads
| File | Size | Min. (4K ctx) | Recommended (80K ctx) | Max (262K ctx) |
|---|---|---|---|---|
Qwen3.6-27B-F16-mtp.gguf |
51 GB | 64 GB Mac ยท 80 GB GPU | 96 GB Mac ยท 80 GB GPU | 96 GB Mac ยท 80 GB GPU |
Qwen3.6-27B-Q8_0-mtp.gguf |
27 GB | 48 GB Mac ยท 48 GB GPU | 48 GB Mac ยท 48 GB GPU | 64 GB Mac ยท 48 GB GPU |
Qwen3.6-27B-Q6_K-mtp.gguf |
21 GB | 36 GB Mac ยท 48 GB GPU | 36 GB Mac ยท 48 GB GPU | 48 GB Mac ยท 48 GB GPU |
Qwen3.6-27B-Q5_K_M-mtp.gguf |
18 GB | 32 GB Mac ยท 24 GB GPU | 36 GB Mac ยท 48 GB GPU | 48 GB Mac ยท 48 GB GPU |
Qwen3.6-27B-Q4_K_M-mtp.gguf |
16 GB | 32 GB Mac ยท 24 GB GPU | 32 GB Mac ยท 24 GB GPU | 48 GB Mac ยท 48 GB GPU |
Qwen3.6-27B-IQ4_XS-mtp.gguf |
14 GB | 24 GB Mac ยท 24 GB GPU | 32 GB Mac ยท 48 GB GPU | 36 GB Mac ยท 48 GB GPU |
Qwen3.6-27B-IQ3_M-mtp.gguf |
12 GB | 24 GB Mac ยท 24 GB GPU | 24 GB Mac ยท 24 GB GPU | 36 GB Mac ยท 48 GB GPU |
Qwen3.6-27B-IQ2_M-mtp.gguf |
9.5 GB | 24 GB Mac ยท 16 GB GPU | 24 GB Mac ยท 24 GB GPU | 32 GB Mac ยท 24 GB GPU |
mmproj-Qwen3.6-27B-f16.gguf |
885 MB | Vision encoder (optional, any tier) | โ | โ |
All tiers include MTP heads and were quantized directly from the F16 conversion for maximum precision. I-quant tiers (IQ4_XS, IQ3_M, IQ2_M) use unsloth's importance matrix. Q5_K_M is the sweet spot โ use Q4_K_M if you're tight on RAM, Q8_0 for high quality, or F16 for long agentic coding sessions where quantization artifacts compound noticeably. GPU means NVIDIA (RTX 3060 = 12 GB, RTX 3090/4090 = 24 GB, A6000 = 48 GB, A100 = 80 GB).
Hardware numbers assume f16 KV for "Min." (4K) and q8_0 KV for "Recommended" (80K) and "Max" (262K).
System prompt & sampling
System prompt
The first line must be:
You are Qwen, created by Alibaba Cloud. You are a helpful assistant.
The model underperforms without it. Append anything after that line.
Thinking toggle
Drop <|think_on|> or <|think_off|> in any message to toggle thinking. The template strips the tag so the model never sees it.
Sampling
From the official Qwen authors. Reserve 128K+ context for thinking mode.
| Mode | temp | top_p | top_k | repeat_penalty |
|---|---|---|---|---|
| Thinking (coding) | 0.6 | 0.95 | 20 | 1.0 |
| Thinking (general) | 1.0 | 0.95 | 20 | 1.0 |
| Non-thinking (general) | 0.7 | 0.8 | 20 | 1.0 |
Compatibility
| Runtime | Status | Why |
|---|---|---|
| llama.cpp (b9180+ / Homebrew) | Works fully | MTP support merged in b9180 (2026-05-16). brew install llama.cpp |
| llama.cpp (pre-b9180) | Does not load | missing tensor โ MTP heads not recognized |
| LM Studio | Does not load | Bundled llama.cpp may not yet include b9180+ |
| Ollama | Does not load | No speculative decoding support yet |
| koboldcpp | Unknown | Depends on bundled llama.cpp version |
LM Studio users: use the MLX 8-bit or MLX 4-bit instead โ full vision + tools + thinking, no MTP.
Other speculative decoding modes
Draft model (separate small model)
Pair with a smaller Qwen 3.5/3.6 model that shares the same tokenizer. Can give ~2.3ร speedup.
llama-cli -m Qwen3.6-27B-Q5_K_M-mtp.gguf \
-md Qwen3.5-0.8B-Q8_0.gguf \
--spec-draft-n-max 10 -ngl 99 -ngld 99 \
-c 4096 -n 2048 --temp 0.7 \
-p "Your prompt"
ngram-mod (no extra model, benefits repeat prompts)
Uses cached n-grams from previous prompts. Works best for repeated/similar prompts.
--spec-type ngram-mod \
--spec-ngram-mod-n-match 24 \
--spec-ngram-mod-n-min 48 \
--spec-ngram-mod-n-max 64 \
--repeat-penalty 1.0
Memory requirements (detailed)
Approximate VRAM on Apple Silicon (unified memory), using Q5_K_M as reference. Includes 150 MiB recurrent state (constant, does not scale with context) plus ~1.5 GB compute/CPU overhead. Only 16 of 65 layers use KV cache โ the other 48 use linear attention. Numbers are measured from actual allocations, not estimates.
| Context | Model | KV (f16) | KV (q8_0) | Overhead | Total (f16) | Total (q8_0) | Min. Mac |
|---|---|---|---|---|---|---|---|
| 4K | 18.2 GB | 0.2 GB | 0.1 GB | 1.5 GB | 20.0 GB | 19.8 GB | 32 GB |
| 8K | 18.2 GB | 0.5 GB | 0.3 GB | 1.5 GB | 20.2 GB | 20.0 GB | 32 GB |
| 32K | 18.2 GB | 2.0 GB | 1.1 GB | 1.5 GB | 21.7 GB | 20.8 GB | 32 GB |
| 64K | 18.2 GB | 4.0 GB | 2.1 GB | 1.6 GB | 23.8 GB | 21.9 GB | 36 GB |
| 80K (recommended) | 18.2 GB | 5.0 GB | 2.7 GB | 1.6 GB | 24.8 GB | 22.5 GB | 36 GB |
| 128K | 18.2 GB | 8.0 GB | 4.2 GB | 1.6 GB | 27.8 GB | 24.1 GB | 48 GB |
| 262K (max native) | 18.2 GB | 16.0 GB | 8.5 GB | 2.3 GB | 36.5 GB | 29.0 GB | 48 GB |
"Total" = model + KV cache + recurrent state + compute/CPU overhead. macOS needs โฅ 8 GB (24 GB Macs: 6 GB; 16 GB Macs: 4 GB). With vision: add 0.9 GB for the mmproj.
KV cache options
| Type | Bits/val | KV size (80K ctx) | Quality | Speed | When to use |
|---|---|---|---|---|---|
f16 |
16 | 5.0 GB | Full | Baseline | Best quality โ use when RAM allows |
q8_0 |
8 | 2.7 GB | Negligible loss | Faster than f16 | When f16 KV doesn't give enough context |
q4_0 |
4 | 1.3 GB | Minor loss | Slightly slower | Max context on limited RAM (โค64K only) |
Recommendation: Leave KV at f16 for best quality. Use q8_0 when f16 doesn't give enough context. Reserve q4_0 for tight RAM โ and only up to 64K context.
Memory per quant tier (4K context, f16 KV)
| Quant | Model | KV + overhead | Total | Min. Mac |
|---|---|---|---|---|
| F16 | 48.5 GB | 3.3 GB | 51.8 GB | 64 GB |
| Q8_0 | 27.0 GB | 2.3 GB | 29.4 GB | 48 GB |
| Q6_K | 21.3 GB | 2.0 GB | 23.3 GB | 36 GB |
| Q5_K_M | 18.2 GB | 1.8 GB | 20.0 GB | 32 GB |
| Q4_K_M | 15.6 GB | 1.6 GB | 17.3 GB | 32 GB |
| IQ4_XS | 13.8 GB | 1.5 GB | 15.3 GB | 24 GB |
| IQ3_M | 11.9 GB | 1.4 GB | 13.3 GB | 24 GB |
| IQ2_M | 9.6 GB | 1.3 GB | 10.9 GB | 24 GB |
Chat template fixes
The bundled Jinja template fixes several bugs in the official Qwen 3.6 template:
- Tool calls crash on C++ engines. The official template uses Python's
|itemsfilter and|safe, which don't exist in C++ Jinja runtimes (llama.cpp, LM Studio). This template uses direct dictionary key lookups. - The
developerrole crashes. Modern APIs sendmessage.role == "developer". The official template throws an exception. This template maps it tosystem. - Empty
preserve_thinkingspam. The official template wraps every past turn in empty<think/>blocks, wasting context tokens. This template only emits thinking blocks with actual content. </thinking>hallucination handling. The model sometimes generates</thinking>instead of the expected closing tag. Both are handled gracefully.
See Qwen-Fixed-Chat-Templates for the standalone template repo.
Note: The fixed template works in llama.cpp but may cause errors in some frameworks (oh-my-pi, Codex, etc.) โ typically
Jinja Exception: System message must be at the beginning.If you hit this, use the default (unfixed) template instead.
Architecture details
| Spec | Value |
|---|---|
| Total params | 27.8B (dense, all active) |
| Layers | 65 (3ร linear attention + 1ร full attention, 16 repetitions) + 1 MTP layer |
| Attention | 24 Q heads, 4 KV heads (GQA), head_dim 256 |
| Linear attention | 16 QK heads, 48 V heads, head_dim 128 |
| FFN | intermediate_size 17408 |
| Context | 262K native, 1M+ with YaRN |
| RoPE | theta 10M, partial_rotary_factor 0.25, mrope_interleaved |
| Vocab | 248K tokens |
| Multi-token prediction | 1 MTP draft layer (15 tensors) |
| model_type | qwen3_5 |
Conversion details
Converted from official Qwen3.6-27B safetensors using mainline convert_hf_to_gguf.py from llama.cpp (b9180+, Homebrew v9240). MTP tensors are included by default โ no custom build needed. The fixed chat template (v19) from Qwen-Fixed-Chat-Templates was embedded in tokenizer_config.json before conversion.
Quantization source: F16 (not Q8_0) โ all tiers are quantized directly from the F16 conversion for maximum precision, avoiding double-quantization artifacts. Standard K-quant tiers (Q8_0, Q6_K, Q5_K_M, Q4_K_M) use no importance matrix. I-quant tiers (IQ4_XS, IQ3_M, IQ2_M) use unsloth's importance matrix (calibrated with chat template at 6Kโ12K context, 76 chunks, 496 entries). IQ2_M keeps MTP tensors at Q4_K since the importance matrix doesn't cover MTP layer tensors.
# Prerequisites
brew install llama.cpp
git clone --depth 1 --filter=blob:none --sparse https://github.com/ggml-org/llama.cpp.git llama.cpp-source
cd llama.cpp-source && git sparse-checkout set convert_hf_to_gguf.py conversion gguf-py
python3 -m venv .venv && .venv/bin/pip install torch numpy tqdm transformers sentencepiece pyyaml requests
.venv/bin/pip install -e gguf-py
# Embed fixed chat template (v19) into source tokenizer_config.json
python3 -c "
import json
with open('Qwen/Qwen3.6-27B/tokenizer_config.json') as f: d = json.load(f)
with open('Qwen-Fixed-Chat-Templates/chat_template_oneline.txt') as f: t = f.read().strip()
d['chat_template'] = t
with open('Qwen/Qwen3.6-27B/tokenizer_config.json', 'w') as f: json.dump(d, f, indent=2, ensure_ascii=False)
"
# Convert to F16 (text + MTP, ~51 GB, ~30-40 min)
.venv/bin/python convert_hf_to_gguf.py Qwen/Qwen3.6-27B/ \
--outtype f16 --outfile Qwen3.6-27B-F16-mtp.gguf --verbose
# Extract vision encoder
.venv/bin/python convert_hf_to_gguf.py Qwen/Qwen3.6-27B/ \
--outtype f16 --mmproj --outfile mmproj-Qwen3.6-27B-f16.gguf --verbose
# Quantize K-quant tiers from F16 (no imatrix)
F16=Qwen3.6-27B-F16-mtp.gguf
llama-quantize $F16 Qwen3.6-27B-Q8_0-mtp.gguf Q8_0
llama-quantize $F16 Qwen3.6-27B-Q6_K-mtp.gguf Q6_K
llama-quantize $F16 Qwen3.6-27B-Q5_K_M-mtp.gguf Q5_K_M
llama-quantize $F16 Qwen3.6-27B-Q4_K_M-mtp.gguf Q4_K_M
# Quantize I-quant tiers from F16 with unsloth imatrix
IMATRIX=imatrix_unsloth.gguf_file
llama-quantize --imatrix $IMATRIX $F16 Qwen3.6-27B-IQ4_XS-mtp.gguf IQ4_XS
llama-quantize --imatrix $IMATRIX $F16 Qwen3.6-27B-IQ3_M-mtp.gguf IQ3_M
# IQ2_M: MTP tensors at Q4_K (imatrix doesn't cover them)
llama-quantize --imatrix $IMATRIX --tensor-type blk.64.=q4_K $F16 Qwen3.6-27B-IQ2_M-mtp.gguf IQ2_M
Links
- Original model
- MLX 8-bit (LM Studio, Apple Silicon native, no MTP)
- MLX 4-bit
- Fixed chat templates
- Qwen3.6 blog post
- MTP benchmark details
Authorship
| Role | Author |
|---|---|
| Original model | Alibaba Cloud (Qwen team) |
| GGUF conversion + MTP + vision + fixed chat template + quantization | froggeric |
| Importance matrix | unsloth |
License
Apache-2.0, inherited from Qwen3.6.
- Downloads last month
- 38,987
2-bit
3-bit
4-bit
5-bit
6-bit
8-bit
16-bit
Model tree for froggeric/Qwen3.6-27B-MTP-GGUF
Base model
Qwen/Qwen3.6-27B