PLUNDERSTRUCK // ROCmFP4 QUANTIZED MODEL // STRIX HALO ยท gfx1151
            โ–—โ–‡โ–‡โ–‡โ–‡โ–‡โ–‡โ–‡โ––                 
           โ–—โ–ˆโ–˜โ–โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ––                
          โ–—โ–›   โ–โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–†โ–†โ–†โ–†โ–†โ–†โ–†โ–†โ–†โ–†โ–…     
         โ–Ÿโ–›    โ–—โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–™โ––   
   โ–„โ–„โ–„โ–„โ–„โ–Ÿโ–›    โ–Ÿโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ––  
 โ–—โ–ˆโ–ˆโ–Œ    โ–šโ––   โ–”โ–”โ–”โ–”โ–”โ–”โ–”โ–”โ–”โ–”โ–”โ–”โ–”โ–”โ–”โ–”โ–”โ–”โ–”โ–”โ–ˆโ–˜  
โ–—โ–ˆโ–ˆโ–ˆโ–ˆโ––    โ–œโ––                    โ–—โ–ˆโ–˜   
โ–œโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–™    โ–œโ–†โ–†โ–†โ–†โ–†โ–†โ–†โ–†โ–†โ–†โ–†โ–†โ–†โ–†โ–†โ–€โ–€โ–€โ–€โ–€โ–œโ–™    
 โ–œโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–™    โ–โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–›       โ–œโ–™   
  โ–œโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–™    โ–โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–›    โ–ƒ    โ–œโ–™  
   โ–€โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–™โ––   โ–โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–˜    โ–Ÿโ–ˆโ–™    โ–€โ–™ 
    โ–โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ––   โ–โ–œโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–˜    โ–Ÿโ–ˆโ–ˆโ–ˆโ–™โ–‚โ–‚โ–‚โ–‚โ–โ–ˆ
    โ–Ÿโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ––    โ–œโ–ˆโ–ˆโ–ˆโ–˜   โ–—โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–›
   โ–Ÿโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–„    โ–œโ–›    โ–—โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–€ 
  โ–โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–€        โ–—โ–›    โ–—โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–€โ–€โ–€โ–€โ–€โ–˜  
    โ–œโ–ˆโ–ˆโ–˜        โ–—โ–›    โ–Ÿโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–›โ–˜        
     โ–œโ–ˆโ–‡โ–‡โ–‡โ–‡โ–‡โ–‡โ–‡โ–‡โ–‡โ–ˆโ––   โ–Ÿโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–›          
                โ–โ–ˆโ–– โ–Ÿโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–›           
                 โ–โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–€            
QWEN3-CODER-NEXT
4-BIT ROCmFP4 ยท 80B-A3B MoE ยท CODE-WEIGHTED IMATRIX ยท AGENTIC CODER ยท SINGLE AMD APU
FORMAT
ROCmFP4 4-BIT
PRECISION
~4.5 BPW
ARCH
QWEN3NEXT
CONTEXT
262 K
PARAMS
80B ยท A3B MoE
DRAFT
NO MTP
BACKEND
VULKAN0
LICENSE
APACHE-2.0
โš  REQUIRES THE ROCmFP4 FORK
The custom q4_0_rocmfp4 / q4_0_rocmfp4_fast tensor types will not load in stock llama.cpp, LM Studio, or Ollama. Build/run with charlie12345/rocmfp4-llama ยท branch mtp-rocmfp4-strix.
NOTE // Ignore HuggingFace's auto-detected "F16"/16-bit badge โ€” its parser can't read ROCmFP4 and mislabels the file. These are ~4.5 bpw 4-bit ROCmFP4 files; pick by filename in Files and versions.

Experimental AMD Strix Halo (gfx1151) quant of Qwen3-Coder-Next โ€” Qwen's agentic coding model (80B total / 3B active high-sparsity MoE, hybrid Gated-DeltaNet attention, arch qwen3next, 262K context) โ€” in the custom ROCmFP4 4-bit format, imatrix-quantized with a code-weighted importance matrix.

01 ยท FILES
File Output head Pick if
โ€ฆ-STRIX-embQ8-imatrix-headQ6.gguf โ˜…Q6_Kthe one build โ€” best speed/quality balance: Q8 embeddings + Q6 output head on the fast single-scale body

One file โ€” the best speed/quality balance in ROCmFP4 for Strix Halo. It keeps the two quality levers that are actually felt โ€” Q8 token embeddings (matching the Q8 source exactly) and a Q6_K output head โ€” on the fast single-scale q4_0_rocmfp4_fast body + a code-weighted imatrix. Not the most faithful possible (see the fidelity link in ยง04) โ€” it's the point where speed and quality meet best. The DeltaNet-specific tensors (ssm_conv1d, ssm_a, norms, router) stay F32; MoE experts + attention/SSM projections are 4-bit ROCmFP4.

NOTE // Q8 embeddings (not f16): the source is Q8_0, so Q8 matches its precision exactly โ€” f16 would be fake-f16 bloat for zero gain (embeddings are a lookup, not a matmul).
02 ยท QUICK START

Run from the folder holding the .gguf (the Qwen ChatML template is baked in โ€” just pass --jinja):

env HSA_OVERRIDE_GFX_VERSION=11.5.1 GGML_HIP_ENABLE_UNIFIED_MEMORY=1 \
llama-server \
  -m Qwen3-Coder-Next-ROCmFP4-STRIX-embQ8-imatrix-headQ6.gguf \
  --alias coder-next \
  --host 0.0.0.0 \
  --port 8080 \
  -c 262144 \
  -ctk q8_0 \
  -ctv q8_0 \
  --temp 0.7 \
  --top-p 0.8 \
  --top-k 20 \
  -dev Vulkan0 \
  -ngl 999 \
  -fa on \
  -b 2048 \
  -ub 256 \
  -t 16 \
  -tb 16 \
  -cpent 256 \
  -ctxcp 32 \
  --cache-reuse 256 \
  --cache-ram 65536 \
  --jinja \
  --parallel 1 \
  --metrics \
  --no-mmap
Flag Function
HSA_OVERRIDE_GFX_VERSION=11.5.1treat the APU as gfx1151 (Strix Halo)
GGML_HIP_ENABLE_UNIFIED_MEMORY=1allow use of the full 128 GB unified memory
-dev Vulkan0run on Vulkan โ€” fastest backend for ROCmFP4 on Strix Halo
-ngl 999 ยท -fa onoffload all layers ยท flash attention
-c 262144context length (256K)
-b 2048 ยท -ub 256 ยท -t/-tb 16prefill batch / micro-batch ยท CPU threads
-ctk q8_0 ยท -ctv q8_0q8_0 (8-bit) KV cache โ€” how we run it; drop to q4_0 to use less memory, or raise to f16
-cpent ยท -ctxcp ยท --cache-reuse ยท --cache-ram 65536cross-turn KV checkpointing + 64 GB resident reuse cache
--temp 0.7 --top-p 0.8 --top-k 20Qwen-Coder recommended sampling
--jinja --parallel 1 --metrics --no-mmapapply baked ChatML template ยท single slot ยท metrics ยท weights in RAM
NOTE // No --spec-* / --spec-type draft-mtp flags โ€” this arch has no MTP head (see ยง04). It's already fast on its own.
03 ยท AGENTIC CODING / TOOLS

Qwen3-Coder-Next is an agentic coder โ€” built to call tools, not narrate code. To wire it up:

  • Chat template: Qwen (ChatML) is baked into the GGUF โ€” just pass --jinja and your client applies it automatically.
  • Tool calling: enable the qwen3_coder tool-call parser in your client (e.g. the matching parser flag in llama-server / your agent harness). Without it, native tool calls won't be parsed and the model tends to narrate code instead of calling tools.
  • Sampling: temp 0.7, top-p 0.8, top-k 20 (Qwen-Coder recommended) โ€” already set in ยง02.
NOTE // The cross-turn reuse cache (--cache-reuse / --cache-ram) keeps long agentic sessions cheap โ€” the leading prompt isn't re-prefilled every turn.
04 ยท PERFORMANCE & QUALITY
DECODE ยท short context~54 t/s (Vulkan / Ryzen AI Max+ 395)
SPECULATIVE DECODEnone (no MTP head)
LONG CONTEXTcheap โ€” DeltaNet near-constant memory
QUANTIZATIONfast single-scale body + Q8 emb + Q6 head + code-weighted imatrix (measured win โ€” below)

This is the best speed/quality balance in ROCmFP4 โ€” by design, not the absolute fastest. On top of the imatrix + Q8 emb + Q6 head, we swept the body kernel against the Q8 source by KL divergence (the right fidelity metric). An all-dual-scale body did edge the fast single-scale body on KL, but the gain sat inside the measurement noise while costing decode speed โ€” so the fast single-scale body + Q8 embeddings + Q6 head is the right point, and the one file we ship.

This mirrors the fuller sweep on our Qwen3.6-27B sibling, where every higher-precision body lever (all-dual-scale, selective Q5/Q6 bumps) bought a KL improvement inside the noise at a real speed cost โ€” and where copying an entire dynamic-quant high-precision allocation onto ROCmFP4 still couldn't match a true dynamic K-quant, because FP4 is intrinsically less faithful than Q4_K's 4-bit. The same format limit applies here: within ROCmFP4, fast body + Q8 emb + Q6 head is the optimal balance; for maximum fidelity reach for a dynamic K-quant of the base (box below). (Directional internal measurements โ€” KL vs Q8 on held-out code; reproduce before citing.)

WANT MAXIMUM FIDELITY INSTEAD OF SPEED? Grab a Q6_K / Q8 dynamic GGUF of the base from Qwen/Qwen3-Coder-Next โ€” higher-bit GGUFs run on this same fork. We optimize for throughput in ROCmFP4; if you want the last bit of fidelity over speed, that's the one to grab.

Fast even without speculative decoding. 3B active params + linear Gated-DeltaNet attention โ†’ ~54 t/s short-context decode on a Ryzen AI Max+ 395 (Vulkan0), and cheap long context. No MTP needed.

NOTE // NO MTP Qwen3-Coder-Next ships without an MTP head, and the ROCmFP4 fork currently wires MTP drafting only for the qwen35/qwen35moe archs, not qwen3next. So these are no-MTP (non-speculative) builds โ€” in practice it doesn't matter, it's fast on its own.

The imatrix โ€” code-weighted, and measured (a clean win here). Quantized with an importance matrix built from a code-weighted calibration mix (~2.6:1 code:general): real multi-language source + code-analysis prompts from eaddario/imatrix-calibration, plus Kalomaze's groups_merged (via froggeric/imatrix) for general.

KL-divergence + perplexity vs the Q8 reference on a held-out code slice (disjoint from calibration), imatrix vs no-imatrix:

Metric (vs Q8, held-out code) No-imatrix Imatrix Change
Median KLD0.005970.00478โˆ’20%
90th-pct KLD0.13420.1083โˆ’19%
RMS ฮ”p8.14%7.36%โˆ’10%
Same top token as Q891.01%91.49%+0.48 pp
Mean PPL3.45563.4686+0.013 (within ยฑ0.077 noise โ€” a wash)

So the imatrix measurably improves quantization fidelity to the full model on code (median KL โˆ’20%, the gold-standard metric), at zero cost (same size/speed). PPL is a statistical wash. Honest scope: this is a fidelity-vs-Q8 measurement on ~20 K tokens of held-out code, not an absolute coding benchmark.

NOTE // On "dual imatrix": a plain merge of two imatrices is mathematically identical to concatenating the corpora at the same ratio โ€” the only real lever is the code:general ratio, which is what's set here. True size-decoupled balancing would need normalized-merge tooling; not used.
05 ยท BUILD (REPRODUCIBLE)
# code-weighted imatrix on the Q8 (single pass; ratio = the real lever)
llama-imatrix -m Qwen3-Coder-Next-Q8_0.gguf -f code-weighted-calib.txt -o coder-next.imatrix -c 512 -ngl 999

# quant -> ROCmFP4 with the imatrix (Q8 embeddings) + Q6 output head โ€” the โ˜… file (ยง01)
# fast single-scale body; --output-tensor-type q6_K raises the output head to Q6_K
llama-quantize --allow-requantize --token-embedding-type q8_0 --output-tensor-type q6_K --imatrix coder-next.imatrix \
  Qwen3-Coder-Next-Q8_0.gguf  Qwen3-Coder-Next-ROCmFP4-STRIX-embQ8-imatrix-headQ6.gguf  Q4_0_ROCMFP4_STRIX

Experimental research build for AMD Strix Halo โ€” hardware/driver/prompt-sensitive, may not reproduce elsewhere. Not native FP4 tensor-core execution.

06 ยท LINEAGE & CREDITS
BASE MODELQwen/Qwen3-Coder-Next (Apache-2.0, Qwen team) ยท 80B-A3B MoE, arch qwen3next
CALIBRATIONeaddario/imatrix-calibration (code) ยท Kalomaze groups_merged via froggeric/imatrix (general)
FORMAT + RUNTIMEcharlie12345/rocmfp4-llama (based on llama.cpp, MIT)

Derivative quantization โ€” verify the base model's license before redistribution / use.

Downloads last month
1,410
GGUF
Model size
80B params
Architecture
qwen3next
Hardware compatibility
Log In to add your hardware

We're not able to determine the quantization variants.

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for plunderstruck/Qwen3-Coder-Next-ROCmFP4-GGUF

Quantized
(96)
this model