Qwen3.6-27B-MTP-ROCmFP4-GGUF

PLUNDERSTRUCK // ROCmFP4 QUANTIZED MODEL // STRIX HALO · gfx1151
            ▗▇▇▇▇▇▇▇▖                 
           ▗█▘▝██████▖                
          ▗▛   ▝██████▆▆▆▆▆▆▆▆▆▆▅     
         ▟▛    ▗█████████████████▙▖   
   ▄▄▄▄▄▟▛    ▟████████████████████▖  
 ▗██▌    ▚▖   ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔█▘  
▗████▖    ▜▖                    ▗█▘   
▜█████▙    ▜▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▀▀▀▀▀▜▙    
 ▜█████▙    ▝████████████▛       ▜▙   
  ▜█████▙    ▝██████████▛    ▃    ▜▙  
   ▀█████▙▖   ▝████████▘    ▟█▙    ▀▙ 
    ▝██████▖   ▝▜█████▘    ▟███▙▂▂▂▂▐█
    ▟███████▖    ▜███▘   ▗███████████▛
   ▟█████████▄    ▜▛    ▗███████████▀ 
  ▝█████▀        ▗▛    ▗██████▀▀▀▀▀▘  
    ▜██▘        ▗▛    ▟█████▛▘        
     ▜█▇▇▇▇▇▇▇▇▇█▖   ▟█████▛          
                ▝█▖ ▟█████▛           
                 ▝███████▀            
QWEN3.6-27B-MTP
4-BIT ROCmFP4 · imatrix + f16 EMBEDDINGS · MTP SELF-SPECULATIVE DECODE · VISION-CAPABLE · SINGLE AMD APU

    
      FORMAT
ROCmFP4 4-BIT

      PRECISION
4.82 BPW

      SIZE
16.5 GB

      CONTEXT
262 K

    

      DRAFT
MTP n-max 5

      VISION
QWEN3-VL

      BACKEND
VULKAN0

      CALIBRATION
imatrix (CODE)

    

⚠ REQUIRES THE ROCmFP4 FORK

The custom q4_0_rocmfp4 / q4_0_rocmfp4_fast tensor types will not load in stock llama.cpp, LM Studio, Ollama, Jan, or koboldcpp. Build/run with charlie12345/rocmfp4-llama · branch mtp-rocmfp4-strix:

git clone https://github.com/charlie12345/rocmfp4-llama

cd rocmfp4-llama && git checkout mtp-rocmfp4-strix

env JOBS=16 scripts/build-strix-rocmfp4-mtp.sh

NOTE // Ignore HuggingFace's auto-detected "F16" / 16-bit badge — its parser only knows standard GGUF quant types, can't read ROCmFP4, and "sees" only the genuinely-f16 token embeddings. These are ~4.8 bpw 4-bit files; pick by filename in Files and versions.

01 · FILES

File	Size	Output head	Pick if
`…-COHERENT-imatrix-embF16-headQ6.gguf` ★	~17.5 GB	Q6_K	recommended — all-dual-scale body, lowest measured KL vs BF16 (§05)
`…-STRIX-imatrix-embF16-headQ6.gguf`	16.9 GB	Q6_K	fast body — ~same fidelity, slightly smaller/faster
`…-STRIX-imatrix-embF16.gguf`	16.5 GB	ROCmFP4 4-bit	smallest / fastest decode

All three share f16 embeddings + the code-calibrated imatrix + MTP head. The COHERENT build adds the all-dual-scale body — lowest measured KL vs BF16 at ~the same decode speed, so it's the recommended default; the STRIX builds use the faster single-scale body and differ only in their output head. Repo also bundles the mmproj-F32.gguf Qwen3-VL vision projector, chat_template.jinja (froggeric's unified Qwen3.6 template — tool calls + inline <|think_off|>/<|think_on|> + vision), and the qwen3.6-27b-code.imatrix (339 chunks) for exact reproduction.

token_embd	F16 (full precision — a lookup, ~zero decode cost)
attention K/V (+ fused QKV)	`q4_0_rocmfp4` (dual-scale)
FFN, lm-head, rest	`q4_0_rocmfp4_fast` (single-scale)
MTP head	preserved (`blk.64.nextn.*`)
imatrix	code-calibrated, applied to all 496 quantizable tensors

02 · QUICK START

Run from the folder holding the .gguf + chat_template.jinja:

env HSA_OVERRIDE_GFX_VERSION=11.5.1 GGML_HIP_ENABLE_UNIFIED_MEMORY=1 \
llama-server \
  -m Qwen3.6-27B-MTP-ROCmFP4-COHERENT-imatrix-embF16-headQ6.gguf \
  --alias qwen27b-mtp \
  --host 0.0.0.0 \
  --port 8080 \
  -dev Vulkan0 \
  -ngl 999 \
  -fa on \
  -c 262144 \
  -b 2048 \
  -ub 256 \
  -t 16 \
  -tb 16 \
  -ctk f16 \
  -ctv f16 \
  -cpent 256 \
  -ctxcp 32 \
  --cache-reuse 256 \
  --cache-ram 65536 \
  --temp 0.6 \
  --top-p 0.95 \
  --top-k 20 \
  --min-p 0.0 \
  --spec-type draft-mtp \
  --spec-draft-device Vulkan0 \
  --spec-draft-ngl all \
  --spec-draft-type-k f16 \
  --spec-draft-type-v f16 \
  --spec-draft-n-max 5 \
  --spec-draft-n-min 0 \
  --spec-draft-p-min 0.0 \
  --spec-draft-p-split 0.10 \
  --chat-template-file chat_template.jinja \
  --reasoning on \
  --reasoning-format deepseek \
  --chat-template-kwargs '{"preserve_thinking": true}' \
  --jinja \
  --parallel 1 \
  --metrics \
  --no-mmap \
  --mmproj mmproj-F32.gguf \
  --image-min-tokens 1024

The last two lines enable vision — the mmproj-F32.gguf Qwen3-VL projector is bundled in this repo (projection_dim 5120); omit them for text-only. --image-min-tokens 1024 is required whenever --mmproj is set (see §04).

Flag	Function
`HSA_OVERRIDE_GFX_VERSION=11.5.1`	treat the APU as gfx1151 (Strix Halo)
`GGML_HIP_ENABLE_UNIFIED_MEMORY=1`	allow use of the full 128 GB unified memory
`-dev Vulkan0`	run on Vulkan (KHR_coopmat) — beats ROCm/HIP here, ~+1.7× prefill
`-ngl 999 · -fa on`	offload all layers · flash attention
`-c 262144`	context length (256K)
`-b 2048 · -ub 256 · -t/-tb 16`	prefill batch / micro-batch (256 is the optimum here — bigger ubatch is slower) · CPU threads
`-ctk f16 · -ctv f16`	f16 KV cache — how we run it; drop to `q8_0`/`q4_0` to use less memory
`-cpent · -ctxcp · --cache-reuse · --cache-ram 65536`	cross-turn KV checkpointing (every 256 tok, keep 32, reuse ≥256-tok prefix) + 64 GB resident reuse cache
`--temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.0`	Qwen3.6 "precise coding" sampling (temp 1.0 for general tasks)
`--spec-type draft-mtp · --spec-draft-n-max 5`	built-in MTP head, self-speculative; draft depth 5
`--spec-draft-device Vulkan0 · -ngl all · type-k/v f16`	draft head on Vulkan, fully offloaded, f16 draft KV
`--chat-template-file chat_template.jinja`	bundled froggeric template (tool calls + think-toggle + vision)
`--reasoning on --reasoning-format deepseek + kwargs {preserve_thinking:true}`	clean `content` + `reasoning_content`; keep `<think>` across turns so cross-turn cache survives
`--jinja --parallel 1 --metrics --no-mmap`	apply template · single slot · metrics · weights in RAM

03 · CODING AGENT / OPENCODE

Multi-turn prompt-cache reuse is what makes this usable. Qwen3.6's recurrent (SSM) state can't be partially rewound, so multi-turn reuse needs a context checkpoint at/before the divergence point. Two defaults otherwise force a full re-prefill every turn — both fixed by the flags above:

Checkpoint cadence. Default -cpent is 8192, so prompts under 8K never get a usable checkpoint. Fix: -cpent 256 -ctxcp 32 --cache-reuse 256 (checkpoint every 256 tokens, keep 32, reuse a matching prefix of ≥256 tokens). Verified: a shared 3,000-token prefix re-prefill dropped 12.4 s → ~0.1 s.
Thinking text breaking the prefix match. --reasoning-format controls where <think> goes. deepseek (used here) gives clean content + reasoning_content, auto-paired with --chat-template-kwargs '{"preserve_thinking": true}' so the template keeps <think> for all turns and reuse holds (with OpenCode the large stable leading context reuses via checkpoints regardless). none leaves <think> inline in content so any content-echoing client gets reuse; deepseek-legacy/auto do not reuse.
Vision + --cache-reuse. With --mmproj loaded the server disables the --cache-reuse feature (it logs "cache_reuse is not supported by multimodal"); we haven't measured whether ordinary cross-turn caching survives with vision (see §04).

--jinja is required so the chat template (and preserve_thinking) apply.

OpenCode — point it at the server as an OpenAI-compatible provider. In single-model mode llama-server ignores the request's model field, so the client's model name is just a label (it does not have to match --alias). The provider below is named lmstudio only because it uses the generic OpenAI-compatible adapter — it points at this llama-server, not LM Studio:

{
  "$schema": "https://opencode.ai/config.json",
  "provider": {
    "lmstudio": {
      "npm": "@ai-sdk/openai-compatible",
      "name": "local llama-server (ROCmFP4)",
      "options": { "baseURL": "http://<host>:8080/v1", "apiKey": "sk-local" },
      "models": {
        "qwen3.6-27b-mtp": {
          "name": "Qwen 3.6 27B",
          "limit": { "context": 262144, "output": 32768 }
        }
      }
    }
  },
  "model": "lmstudio/qwen3.6-27b-mtp",
  "compaction": { "auto": true, "reserved": 16384 }
}

Project-local opencode.json — disable the task tool so agents don't spawn subagents, keeping the whole session on one cache-friendly context:

{
  "$schema": "https://opencode.ai/config.json",
  "agent": {
    "build": { "tools": { "task": false } },
    "plan":  { "tools": { "task": false } }
  }
}

The fork: PlunderStruck/opencode. compaction.auto summarizes history when the context fills — which in stock OpenCode rewrites the leading prompt and invalidates the cache, forcing a full re-prefill. This fork compacts without breaking the cached prefix (plus a few other adjustments), so cache reuse survives compaction. Paired with the checkpoint flags above, long sessions stay fast and actually usable.

04 · VISION

Qwen3-VL lineage — vision works via the bundled mmproj-F32.gguf projector at launch with --mmproj (no different LLM GGUF needed). It's the Qwen3-VL projector (projection_dim 5120, matches this model's hidden size), shipped in this repo.

# add to your llama-server launch:
  --mmproj mmproj-F32.gguf \
  --image-min-tokens 1024     # REQUIRED — Qwen-VL needs >=1024 image tokens or it misreads fine detail

Without --image-min-tokens 1024 the server feeds too few image tokens and the model describes images incorrectly (right gist, wrong detail) — the server even logs a warning at load. Verified: a code label misread at default tokens read correctly once the flag was set.

NOTE // thinking model → for one-shot image Q&A use the bundled template's inline <|think_off|> or allow enough tokens to finish <think>, else the visible answer can come back empty. With --mmproj loaded the server disables the --cache-reuse feature (it logs "cache_reuse is not supported by multimodal"); whether ordinary cross-turn caching still helps with vision isn't something we've benchmarked.

05 · PERFORMANCE & QUALITY

Recommended build = COHERENT (we measured it). We swept the quant recipe and rank by KL divergence vs the BF16 reference on held-out text (lower = more faithful). The all-dual-scale body (COHERENT) beats the fast-body STRIX build at ~the same decode speed:

Build	emb / head / body	Mean KLD vs BF16 ↓	Top-token
`COHERENT-imatrix-embF16-headQ6` ★	f16 / Q6 / all-dual	0.1191	92.26%
`STRIX-imatrix-embF16-headQ6`	f16 / Q6 / fast	0.1261	91.75%

Hands-on observations from daily use on a Framework Desktop / AMD Ryzen AI Max+ 395 (gfx1151, 128 GB unified, ROCm 7.2.0) — directional internal checks, not formal benchmarks; reproduce before citing.

DECODE · short context	~33 t/s (Vulkan / Strix Halo)
DECODE · ~140K context	~18 t/s
MTP DRAFT ACCEPTANCE · warm, f16 KV	~0.87–0.90
ARCHITECTURE	hybrid SSM + attention (48 SSM + 17 attention blocks) — only attention layers grow a KV cache, so it degrades gracefully at long context
QUANTIZATION	code-calibrated imatrix + f16 embeddings (measured small win on code)

UPSTREAM BENCHMARK // Base is Qwen's Qwen3.6-27B (via unsloth). Qwen publishes official benchmarks as a figure on the base card — see there. NOT re-measured on this ROCmFP4 quant.

f16 KV is a config choice — full-precision KV is how we run it; 128 GB unified RAM affords it. On less memory drop to -ctk q8_0 -ctv q8_0.

f16 token embeddings were the single change felt the most: raising the token-embedding layer to full precision made the model follow instructions noticeably better — the embedding is the foundation every layer builds on, and the vocab is large, so a faithful embedding pays off at near-zero speed cost (it's a lookup, not a matmul). The code-calibrated imatrix is a free polish on top (same size and speed) — small, in the right direction on code:

Test set (n_ctx=512)	no-imatrix	this (imatrix)
held-out code	1.8631	1.8596
held-out prose	5.7109	5.7165

Tiny improvement on code (the calibration domain), neutral on prose — expected at this bit rate; at 4+ bpw the base quant is already close to the original, so imatrix is a polish, not a transformation.

The Q6-head variant — a step up (experimental). It raises the output head (output.weight) from 4-bit ROCmFP4 to standard Q6_K and leaves everything else untouched. The embedding is the input side; the output head is the output side — sharpening both beats sharpening either. Observed: a further step up in instruction-following beyond the f16 embeddings (reaching for the specific tool asked for, sticking to task rules/format more reliably). Two held-out measurements:

Test set	daily (4-bit head)	Q6 head
held-out code (perplexity)	1.8596	1.8550
held-out prose (perplexity)	5.7165	5.6761
KL vs BF16 (mean, lower=more faithful)	≈0.0369	≈0.0345 (~6% nearer)

The Q6 head improved both code and prose perplexity (the imatrix alone only helped code) and was closer to BF16 on every measure. It still agrees with BF16's top word 96% of the time either way — so the head mostly sharpens confidence on the same choice rather than flipping it. The cost: decode is **5–7% slower** at short context (the head is a fixed per-token cost, so the gap shrinks at long context); size grows ~0.4 GB. Small but consistent gains across two tests and two text types — internal checks, not formal benchmarks; reproduce before citing.

06 · BUILD (REPRODUCIBLE)

Calibration corpus (code_calibration.txt): a concatenation of three files from the froggeric/imatrix dataset — groups_merged.txt + code.txt + technical.txt (~646 KB total) — code-heavy but diverse enough to avoid domain overfitting. The resulting imatrix (qwen3.6-27b-code.imatrix, 339 chunks) is included in this repo.

# 1) importance matrix
llama-imatrix -m Qwen3.6-27B-BF16-00001-of-00002.gguf \
  -f code_calibration.txt -o qwen3.6-27b-code.imatrix \
  -dev Vulkan0 -ngl 999 -fa on -c 512

# 2) quantize: quality-biased STRIX preset + f16 embeddings + imatrix  (daily driver)
llama-quantize \
  --imatrix qwen3.6-27b-code.imatrix \
  --token-embedding-type f16 \
  Qwen3.6-27B-BF16-00001-of-00002.gguf \
  Qwen3.6-27B-MTP-ROCmFP4-STRIX-imatrix-embF16.gguf \
  Q4_0_ROCMFP4_STRIX

# 3) headQ6 variant — same as above + one extra flag (--output-tensor-type q6_K)
llama-quantize \
  --imatrix qwen3.6-27b-code.imatrix \
  --token-embedding-type f16 \
  --output-tensor-type q6_K \
  Qwen3.6-27B-BF16-00001-of-00002.gguf \
  Qwen3.6-27B-MTP-ROCmFP4-STRIX-imatrix-embF16-headQ6.gguf \
  Q4_0_ROCMFP4_STRIX

Experimental research build for AMD Strix Halo — hardware/driver/model/prompt-sensitive, may not reproduce on other GPUs. Not native FP4 tensor-core execution. Do not treat these numbers as upstream llama.cpp claims.

07 · LINEAGE & CREDITS

BASE MODEL	Qwen3.6-27B (Qwen team) — dense, with built-in MTP head (`nextn_predict_layers=1`, so draft-MTP survives quantization). Derivative quant inherits the base model's license.
BF16 GGUF SOURCE	unsloth/Qwen3.6-27B-MTP-GGUF @ `5cb35eb3dcbf52dbce5f87dbc64df6aaffadcace`
FORMAT + RUNTIME	charlie12345/rocmfp4-llama (based on llama.cpp, MIT)
CHAT TEMPLATE	froggeric/Qwen-Fixed-Chat-Templates
CALIBRATION DATA	froggeric/imatrix
AGENT FORK	PlunderStruck/opencode

Derivative quantization — verify the base model's (Qwen3.6) license before redistribution / use.

Downloads last month: 30

GGUF

Model size

0.5B params

Architecture

clip

Hardware compatibility

We're not able to determine the quantization variants.

View all variants

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Atomic-Germ/Qwen3.6-27B-MTP-ROCmFP4-GGUF

Base model

Qwen/Qwen3.6-27B

Quantized

unsloth/Qwen3.6-27B-MTP-GGUF

Quantized

(4)

this model

FORMAT ROCmFP4 4-BIT	PRECISION 4.82 BPW	SIZE 16.5 GB	CONTEXT 262 K
DRAFT MTP n-max 5	VISION QWEN3-VL	BACKEND VULKAN0	CALIBRATION imatrix (CODE)