▗▇▇▇▇▇▇▇▖
▗█▘▝██████▖
▗▛ ▝██████▆▆▆▆▆▆▆▆▆▆▅
▟▛ ▗█████████████████▙▖
▄▄▄▄▄▟▛ ▟████████████████████▖
▗██▌ ▚▖ ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔█▘
▗████▖ ▜▖ ▗█▘
▜█████▙ ▜▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▀▀▀▀▀▜▙
▜█████▙ ▝████████████▛ ▜▙
▜█████▙ ▝██████████▛ ▃ ▜▙
▀█████▙▖ ▝████████▘ ▟█▙ ▀▙
▝██████▖ ▝▜█████▘ ▟███▙▂▂▂▂▐█
▟███████▖ ▜███▘ ▗███████████▛
▟█████████▄ ▜▛ ▗███████████▀
▝█████▀ ▗▛ ▗██████▀▀▀▀▀▘
▜██▘ ▗▛ ▟█████▛▘
▜█▇▇▇▇▇▇▇▇▇█▖ ▟█████▛
▝█▖ ▟█████▛
▝███████▀
FORMAT ROCmFP4 4-BIT |
PRECISION 4.82 BPW |
SIZE 16.5 GB |
CONTEXT 262 K |
DRAFT MTP n-max 5 |
VISION QWEN3-VL |
BACKEND VULKAN0 |
CALIBRATION imatrix (CODE) |
The custom
q4_0_rocmfp4 / q4_0_rocmfp4_fast tensor types will not load in stock llama.cpp, LM Studio, Ollama, Jan, or koboldcpp. Build/run with charlie12345/rocmfp4-llama · branch mtp-rocmfp4-strix:
git clone https://github.com/charlie12345/rocmfp4-llamacd rocmfp4-llama && git checkout mtp-rocmfp4-strixenv JOBS=16 scripts/build-strix-rocmfp4-mtp.sh
All three share f16 embeddings + the code-calibrated imatrix + MTP head. The COHERENT build adds the all-dual-scale body — lowest measured KL vs BF16 at ~the same decode speed, so it's the recommended default; the STRIX builds use the faster single-scale body and differ only in their output head. Repo also bundles the mmproj-F32.gguf Qwen3-VL vision projector, chat_template.jinja (froggeric's unified Qwen3.6 template — tool calls + inline <|think_off|>/<|think_on|> + vision), and the qwen3.6-27b-code.imatrix (339 chunks) for exact reproduction.
Run from the folder holding the .gguf + chat_template.jinja:
env HSA_OVERRIDE_GFX_VERSION=11.5.1 GGML_HIP_ENABLE_UNIFIED_MEMORY=1 \
llama-server \
-m Qwen3.6-27B-MTP-ROCmFP4-COHERENT-imatrix-embF16-headQ6.gguf \
--alias qwen27b-mtp \
--host 0.0.0.0 \
--port 8080 \
-dev Vulkan0 \
-ngl 999 \
-fa on \
-c 262144 \
-b 2048 \
-ub 256 \
-t 16 \
-tb 16 \
-ctk f16 \
-ctv f16 \
-cpent 256 \
-ctxcp 32 \
--cache-reuse 256 \
--cache-ram 65536 \
--temp 0.6 \
--top-p 0.95 \
--top-k 20 \
--min-p 0.0 \
--spec-type draft-mtp \
--spec-draft-device Vulkan0 \
--spec-draft-ngl all \
--spec-draft-type-k f16 \
--spec-draft-type-v f16 \
--spec-draft-n-max 5 \
--spec-draft-n-min 0 \
--spec-draft-p-min 0.0 \
--spec-draft-p-split 0.10 \
--chat-template-file chat_template.jinja \
--reasoning on \
--reasoning-format deepseek \
--chat-template-kwargs '{"preserve_thinking": true}' \
--jinja \
--parallel 1 \
--metrics \
--no-mmap \
--mmproj mmproj-F32.gguf \
--image-min-tokens 1024
The last two lines enable vision — the mmproj-F32.gguf Qwen3-VL projector is bundled in this repo (projection_dim 5120); omit them for text-only. --image-min-tokens 1024 is required whenever --mmproj is set (see §04).
Multi-turn prompt-cache reuse is what makes this usable. Qwen3.6's recurrent (SSM) state can't be partially rewound, so multi-turn reuse needs a context checkpoint at/before the divergence point. Two defaults otherwise force a full re-prefill every turn — both fixed by the flags above:
- Checkpoint cadence. Default
-cpentis 8192, so prompts under 8K never get a usable checkpoint. Fix:-cpent 256 -ctxcp 32 --cache-reuse 256(checkpoint every 256 tokens, keep 32, reuse a matching prefix of ≥256 tokens). Verified: a shared 3,000-token prefix re-prefill dropped 12.4 s → ~0.1 s. - Thinking text breaking the prefix match.
--reasoning-formatcontrols where<think>goes.deepseek(used here) gives cleancontent+reasoning_content, auto-paired with--chat-template-kwargs '{"preserve_thinking": true}'so the template keeps<think>for all turns and reuse holds (with OpenCode the large stable leading context reuses via checkpoints regardless).noneleaves<think>inline incontentso any content-echoing client gets reuse;deepseek-legacy/autodo not reuse. - Vision +
--cache-reuse. With--mmprojloaded the server disables the--cache-reusefeature (it logs "cache_reuse is not supported by multimodal"); we haven't measured whether ordinary cross-turn caching survives with vision (see §04).
--jinja is required so the chat template (and preserve_thinking) apply.
OpenCode — point it at the server as an OpenAI-compatible provider. In single-model mode llama-server ignores the request's model field, so the client's model name is just a label (it does not have to match --alias). The provider below is named lmstudio only because it uses the generic OpenAI-compatible adapter — it points at this llama-server, not LM Studio:
{
"$schema": "https://opencode.ai/config.json",
"provider": {
"lmstudio": {
"npm": "@ai-sdk/openai-compatible",
"name": "local llama-server (ROCmFP4)",
"options": { "baseURL": "http://<host>:8080/v1", "apiKey": "sk-local" },
"models": {
"qwen3.6-27b-mtp": {
"name": "Qwen 3.6 27B",
"limit": { "context": 262144, "output": 32768 }
}
}
}
},
"model": "lmstudio/qwen3.6-27b-mtp",
"compaction": { "auto": true, "reserved": 16384 }
}
Project-local opencode.json — disable the task tool so agents don't spawn subagents, keeping the whole session on one cache-friendly context:
{
"$schema": "https://opencode.ai/config.json",
"agent": {
"build": { "tools": { "task": false } },
"plan": { "tools": { "task": false } }
}
}
The fork: PlunderStruck/opencode. compaction.auto summarizes history when the context fills — which in stock OpenCode rewrites the leading prompt and invalidates the cache, forcing a full re-prefill. This fork compacts without breaking the cached prefix (plus a few other adjustments), so cache reuse survives compaction. Paired with the checkpoint flags above, long sessions stay fast and actually usable.
Qwen3-VL lineage — vision works via the bundled mmproj-F32.gguf projector at launch with --mmproj (no different LLM GGUF needed). It's the Qwen3-VL projector (projection_dim 5120, matches this model's hidden size), shipped in this repo.
# add to your llama-server launch:
--mmproj mmproj-F32.gguf \
--image-min-tokens 1024 # REQUIRED — Qwen-VL needs >=1024 image tokens or it misreads fine detail
Without --image-min-tokens 1024 the server feeds too few image tokens and the model describes images incorrectly (right gist, wrong detail) — the server even logs a warning at load. Verified: a code label misread at default tokens read correctly once the flag was set.
<|think_off|> or allow enough tokens to finish <think>, else the visible answer can come back empty. With --mmproj loaded the server disables the --cache-reuse feature (it logs "cache_reuse is not supported by multimodal"); whether ordinary cross-turn caching still helps with vision isn't something we've benchmarked.
Recommended build = COHERENT (we measured it). We swept the quant recipe and rank by KL divergence vs the BF16 reference on held-out text (lower = more faithful). The all-dual-scale body (COHERENT) beats the fast-body STRIX build at ~the same decode speed:
Hands-on observations from daily use on a Framework Desktop / AMD Ryzen AI Max+ 395 (gfx1151, 128 GB unified, ROCm 7.2.0) — directional internal checks, not formal benchmarks; reproduce before citing.
f16 KV is a config choice — full-precision KV is how we run it; 128 GB unified RAM affords it. On less memory drop to -ctk q8_0 -ctv q8_0.
f16 token embeddings were the single change felt the most: raising the token-embedding layer to full precision made the model follow instructions noticeably better — the embedding is the foundation every layer builds on, and the vocab is large, so a faithful embedding pays off at near-zero speed cost (it's a lookup, not a matmul). The code-calibrated imatrix is a free polish on top (same size and speed) — small, in the right direction on code:
Tiny improvement on code (the calibration domain), neutral on prose — expected at this bit rate; at 4+ bpw the base quant is already close to the original, so imatrix is a polish, not a transformation.
The Q6-head variant — a step up (experimental). It raises the output head (output.weight) from 4-bit ROCmFP4 to standard Q6_K and leaves everything else untouched. The embedding is the input side; the output head is the output side — sharpening both beats sharpening either. Observed: a further step up in instruction-following beyond the f16 embeddings (reaching for the specific tool asked for, sticking to task rules/format more reliably). Two held-out measurements:
The Q6 head improved both code and prose perplexity (the imatrix alone only helped code) and was closer to BF16 on every measure. It still agrees with BF16's top word 96% of the time either way — so the head mostly sharpens confidence on the same choice rather than flipping it. The cost: decode is **5–7% slower** at short context (the head is a fixed per-token cost, so the gap shrinks at long context); size grows ~0.4 GB. Small but consistent gains across two tests and two text types — internal checks, not formal benchmarks; reproduce before citing.
Calibration corpus (code_calibration.txt): a concatenation of three files from the froggeric/imatrix dataset — groups_merged.txt + code.txt + technical.txt (~646 KB total) — code-heavy but diverse enough to avoid domain overfitting. The resulting imatrix (qwen3.6-27b-code.imatrix, 339 chunks) is included in this repo.
# 1) importance matrix
llama-imatrix -m Qwen3.6-27B-BF16-00001-of-00002.gguf \
-f code_calibration.txt -o qwen3.6-27b-code.imatrix \
-dev Vulkan0 -ngl 999 -fa on -c 512
# 2) quantize: quality-biased STRIX preset + f16 embeddings + imatrix (daily driver)
llama-quantize \
--imatrix qwen3.6-27b-code.imatrix \
--token-embedding-type f16 \
Qwen3.6-27B-BF16-00001-of-00002.gguf \
Qwen3.6-27B-MTP-ROCmFP4-STRIX-imatrix-embF16.gguf \
Q4_0_ROCMFP4_STRIX
# 3) headQ6 variant — same as above + one extra flag (--output-tensor-type q6_K)
llama-quantize \
--imatrix qwen3.6-27b-code.imatrix \
--token-embedding-type f16 \
--output-tensor-type q6_K \
Qwen3.6-27B-BF16-00001-of-00002.gguf \
Qwen3.6-27B-MTP-ROCmFP4-STRIX-imatrix-embF16-headQ6.gguf \
Q4_0_ROCMFP4_STRIX
Experimental research build for AMD Strix Halo — hardware/driver/model/prompt-sensitive, may not reproduce on other GPUs. Not native FP4 tensor-core execution. Do not treat these numbers as upstream llama.cpp claims.
Derivative quantization — verify the base model's (Qwen3.6) license before redistribution / use.
- Downloads last month
- 30
We're not able to determine the quantization variants.