Instructions to use plunderstruck/Qwen3-Coder-Next-ROCmFP4-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use plunderstruck/Qwen3-Coder-Next-ROCmFP4-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="plunderstruck/Qwen3-Coder-Next-ROCmFP4-GGUF",
	filename="Qwen3-Coder-Next-ROCmFP4-STRIX-embQ8-imatrix-headQ6.gguf",
)

llm.create_chat_completion(
	messages = "No input example has been defined for this model task."
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use plunderstruck/Qwen3-Coder-Next-ROCmFP4-GGUF with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf plunderstruck/Qwen3-Coder-Next-ROCmFP4-GGUF
# Run inference directly in the terminal:
llama-cli -hf plunderstruck/Qwen3-Coder-Next-ROCmFP4-GGUF

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf plunderstruck/Qwen3-Coder-Next-ROCmFP4-GGUF
# Run inference directly in the terminal:
llama-cli -hf plunderstruck/Qwen3-Coder-Next-ROCmFP4-GGUF

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf plunderstruck/Qwen3-Coder-Next-ROCmFP4-GGUF
# Run inference directly in the terminal:
./llama-cli -hf plunderstruck/Qwen3-Coder-Next-ROCmFP4-GGUF

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf plunderstruck/Qwen3-Coder-Next-ROCmFP4-GGUF
# Run inference directly in the terminal:
./build/bin/llama-cli -hf plunderstruck/Qwen3-Coder-Next-ROCmFP4-GGUF

Use Docker

docker model run hf.co/plunderstruck/Qwen3-Coder-Next-ROCmFP4-GGUF

LM Studio
Jan
Ollama
How to use plunderstruck/Qwen3-Coder-Next-ROCmFP4-GGUF with Ollama:
```
ollama run hf.co/plunderstruck/Qwen3-Coder-Next-ROCmFP4-GGUF
```

Unsloth Studio

How to use plunderstruck/Qwen3-Coder-Next-ROCmFP4-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for plunderstruck/Qwen3-Coder-Next-ROCmFP4-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for plunderstruck/Qwen3-Coder-Next-ROCmFP4-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for plunderstruck/Qwen3-Coder-Next-ROCmFP4-GGUF to start chatting

How to use plunderstruck/Qwen3-Coder-Next-ROCmFP4-GGUF with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf plunderstruck/Qwen3-Coder-Next-ROCmFP4-GGUF

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "plunderstruck/Qwen3-Coder-Next-ROCmFP4-GGUF"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use plunderstruck/Qwen3-Coder-Next-ROCmFP4-GGUF with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf plunderstruck/Qwen3-Coder-Next-ROCmFP4-GGUF

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default plunderstruck/Qwen3-Coder-Next-ROCmFP4-GGUF

Run Hermes

hermes

Atomic Chat new
Docker Model Runner
How to use plunderstruck/Qwen3-Coder-Next-ROCmFP4-GGUF with Docker Model Runner:
```
docker model run hf.co/plunderstruck/Qwen3-Coder-Next-ROCmFP4-GGUF
```

Lemonade

How to use plunderstruck/Qwen3-Coder-Next-ROCmFP4-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull plunderstruck/Qwen3-Coder-Next-ROCmFP4-GGUF

Run and chat with the model

lemonade run user.Qwen3-Coder-Next-ROCmFP4-GGUF-{{QUANT_TAG}}

List all available models

lemonade list

PLUNDERSTRUCK // ROCmFP4 QUANTIZED MODEL // STRIX HALO · gfx1151
            ▗▇▇▇▇▇▇▇▖                 
           ▗█▘▝██████▖                
          ▗▛   ▝██████▆▆▆▆▆▆▆▆▆▆▅     
         ▟▛    ▗█████████████████▙▖   
   ▄▄▄▄▄▟▛    ▟████████████████████▖  
 ▗██▌    ▚▖   ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔█▘  
▗████▖    ▜▖                    ▗█▘   
▜█████▙    ▜▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▀▀▀▀▀▜▙    
 ▜█████▙    ▝████████████▛       ▜▙   
  ▜█████▙    ▝██████████▛    ▃    ▜▙  
   ▀█████▙▖   ▝████████▘    ▟█▙    ▀▙ 
    ▝██████▖   ▝▜█████▘    ▟███▙▂▂▂▂▐█
    ▟███████▖    ▜███▘   ▗███████████▛
   ▟█████████▄    ▜▛    ▗███████████▀ 
  ▝█████▀        ▗▛    ▗██████▀▀▀▀▀▘  
    ▜██▘        ▗▛    ▟█████▛▘        
     ▜█▇▇▇▇▇▇▇▇▇█▖   ▟█████▛          
                ▝█▖ ▟█████▛           
                 ▝███████▀            
QWEN3-CODER-NEXT
4-BIT ROCmFP4 · 80B-A3B MoE · CODE-WEIGHTED IMATRIX · AGENTIC CODER · SINGLE AMD APU

    
      FORMAT
ROCmFP4 4-BIT

      PRECISION
~4.5 BPW

      ARCH
QWEN3NEXT

      CONTEXT
262 K

    

      PARAMS
80B · A3B MoE

      DRAFT
NO MTP

      BACKEND
VULKAN0

      LICENSE
APACHE-2.0

    

⚠ REQUIRES THE ROCmFP4 FORK

The custom q4_0_rocmfp4 / q4_0_rocmfp4_fast tensor types will not load in stock llama.cpp, LM Studio, or Ollama. Build/run with charlie12345/rocmfp4-llama · branch mtp-rocmfp4-strix.

NOTE // Ignore HuggingFace's auto-detected "F16"/16-bit badge — its parser can't read ROCmFP4 and mislabels the file. These are ~4.5 bpw 4-bit ROCmFP4 files; pick by filename in Files and versions.

Experimental AMD Strix Halo (gfx1151) quant of Qwen3-Coder-Next — Qwen's agentic coding model (80B total / 3B active high-sparsity MoE, hybrid Gated-DeltaNet attention, arch qwen3next, 262K context) — in the custom ROCmFP4 4-bit format, imatrix-quantized with a code-weighted importance matrix.

01 · FILES

File	Output head	Pick if
`…-STRIX-embQ8-imatrix-headQ6.gguf` ★	Q6_K	the one build — best speed/quality balance: Q8 embeddings + Q6 output head on the fast single-scale body

One file — the best speed/quality balance in ROCmFP4 for Strix Halo. It keeps the two quality levers that are actually felt — Q8 token embeddings (matching the Q8 source exactly) and a Q6_K output head — on the fast single-scale q4_0_rocmfp4_fast body + a code-weighted imatrix. Not the most faithful possible (see the fidelity link in §04) — it's the point where speed and quality meet best. The DeltaNet-specific tensors (ssm_conv1d, ssm_a, norms, router) stay F32; MoE experts + attention/SSM projections are 4-bit ROCmFP4.

NOTE // Q8 embeddings (not f16): the source is Q8_0, so Q8 matches its precision exactly — f16 would be fake-f16 bloat for zero gain (embeddings are a lookup, not a matmul).

02 · QUICK START

Run from the folder holding the .gguf (the Qwen ChatML template is baked in — just pass --jinja):

env HSA_OVERRIDE_GFX_VERSION=11.5.1 GGML_HIP_ENABLE_UNIFIED_MEMORY=1 \
llama-server \
  -m Qwen3-Coder-Next-ROCmFP4-STRIX-embQ8-imatrix-headQ6.gguf \
  --alias coder-next \
  --host 0.0.0.0 \
  --port 8080 \
  -c 262144 \
  -ctk q8_0 \
  -ctv q8_0 \
  --temp 0.7 \
  --top-p 0.8 \
  --top-k 20 \
  -dev Vulkan0 \
  -ngl 999 \
  -fa on \
  -b 2048 \
  -ub 256 \
  -t 16 \
  -tb 16 \
  -cpent 256 \
  -ctxcp 32 \
  --cache-reuse 256 \
  --cache-ram 65536 \
  --jinja \
  --parallel 1 \
  --metrics \
  --no-mmap

Flag	Function
`HSA_OVERRIDE_GFX_VERSION=11.5.1`	treat the APU as gfx1151 (Strix Halo)
`GGML_HIP_ENABLE_UNIFIED_MEMORY=1`	allow use of the full 128 GB unified memory
`-dev Vulkan0`	run on Vulkan — fastest backend for ROCmFP4 on Strix Halo
`-ngl 999 · -fa on`	offload all layers · flash attention
`-c 262144`	context length (256K)
`-b 2048 · -ub 256 · -t/-tb 16`	prefill batch / micro-batch · CPU threads
`-ctk q8_0 · -ctv q8_0`	q8_0 (8-bit) KV cache — how we run it; drop to `q4_0` to use less memory, or raise to `f16`
`-cpent · -ctxcp · --cache-reuse · --cache-ram 65536`	cross-turn KV checkpointing + 64 GB resident reuse cache
`--temp 0.7 --top-p 0.8 --top-k 20`	Qwen-Coder recommended sampling
`--jinja --parallel 1 --metrics --no-mmap`	apply baked ChatML template · single slot · metrics · weights in RAM

NOTE // No --spec-* / --spec-type draft-mtp flags — this arch has no MTP head (see §04). It's already fast on its own.

03 · AGENTIC CODING / TOOLS

Qwen3-Coder-Next is an agentic coder — built to call tools, not narrate code. To wire it up:

Chat template: Qwen (ChatML) is baked into the GGUF — just pass --jinja and your client applies it automatically.
Tool calling: enable the qwen3_coder tool-call parser in your client (e.g. the matching parser flag in llama-server / your agent harness). Without it, native tool calls won't be parsed and the model tends to narrate code instead of calling tools.
Sampling: temp 0.7, top-p 0.8, top-k 20 (Qwen-Coder recommended) — already set in §02.

NOTE // The cross-turn reuse cache (--cache-reuse / --cache-ram) keeps long agentic sessions cheap — the leading prompt isn't re-prefilled every turn.

04 · PERFORMANCE & QUALITY

DECODE · short context	~54 t/s (Vulkan / Ryzen AI Max+ 395)
SPECULATIVE DECODE	none (no MTP head)
LONG CONTEXT	cheap — DeltaNet near-constant memory
QUANTIZATION	fast single-scale body + Q8 emb + Q6 head + code-weighted imatrix (measured win — below)

This is the best speed/quality balance in ROCmFP4 — by design, not the absolute fastest. On top of the imatrix + Q8 emb + Q6 head, we swept the body kernel against the Q8 source by KL divergence (the right fidelity metric). An all-dual-scale body did edge the fast single-scale body on KL, but the gain sat inside the measurement noise while costing decode speed — so the fast single-scale body + Q8 embeddings + Q6 head is the right point, and the one file we ship.

This mirrors the fuller sweep on our Qwen3.6-27B sibling, where every higher-precision body lever (all-dual-scale, selective Q5/Q6 bumps) bought a KL improvement inside the noise at a real speed cost — and where copying an entire dynamic-quant high-precision allocation onto ROCmFP4 still couldn't match a true dynamic K-quant, because FP4 is intrinsically less faithful than Q4_K's 4-bit. The same format limit applies here: within ROCmFP4, fast body + Q8 emb + Q6 head is the optimal balance; for maximum fidelity reach for a dynamic K-quant of the base (box below). (Directional internal measurements — KL vs Q8 on held-out code; reproduce before citing.)

WANT MAXIMUM FIDELITY INSTEAD OF SPEED? Grab a Q6_K / Q8 dynamic GGUF of the base from Qwen/Qwen3-Coder-Next — higher-bit GGUFs run on this same fork. We optimize for throughput in ROCmFP4; if you want the last bit of fidelity over speed, that's the one to grab.

Fast even without speculative decoding. 3B active params + linear Gated-DeltaNet attention → ~54 t/s short-context decode on a Ryzen AI Max+ 395 (Vulkan0), and cheap long context. No MTP needed.

NOTE // NO MTP Qwen3-Coder-Next ships without an MTP head, and the ROCmFP4 fork currently wires MTP drafting only for the qwen35/qwen35moe archs, not qwen3next. So these are no-MTP (non-speculative) builds — in practice it doesn't matter, it's fast on its own.

The imatrix — code-weighted, and measured (a clean win here). Quantized with an importance matrix built from a code-weighted calibration mix (~2.6:1 code:general): real multi-language source + code-analysis prompts from eaddario/imatrix-calibration, plus Kalomaze's groups_merged (via froggeric/imatrix) for general.

KL-divergence + perplexity vs the Q8 reference on a held-out code slice (disjoint from calibration), imatrix vs no-imatrix:

Metric (vs Q8, held-out code)	No-imatrix	Imatrix	Change
Median KLD	0.00597	0.00478	−20%
90th-pct KLD	0.1342	0.1083	−19%
RMS Δp	8.14%	7.36%	−10%
Same top token as Q8	91.01%	91.49%	+0.48 pp
Mean PPL	3.4556	3.4686	+0.013 (within ±0.077 noise — a wash)

So the imatrix measurably improves quantization fidelity to the full model on code (median KL −20%, the gold-standard metric), at zero cost (same size/speed). PPL is a statistical wash. Honest scope: this is a fidelity-vs-Q8 measurement on ~20 K tokens of held-out code, not an absolute coding benchmark.

NOTE // On "dual imatrix": a plain merge of two imatrices is mathematically identical to concatenating the corpora at the same ratio — the only real lever is the code:general ratio, which is what's set here. True size-decoupled balancing would need normalized-merge tooling; not used.

05 · BUILD (REPRODUCIBLE)

# code-weighted imatrix on the Q8 (single pass; ratio = the real lever)
llama-imatrix -m Qwen3-Coder-Next-Q8_0.gguf -f code-weighted-calib.txt -o coder-next.imatrix -c 512 -ngl 999

# quant -> ROCmFP4 with the imatrix (Q8 embeddings) + Q6 output head — the ★ file (§01)
# fast single-scale body; --output-tensor-type q6_K raises the output head to Q6_K
llama-quantize --allow-requantize --token-embedding-type q8_0 --output-tensor-type q6_K --imatrix coder-next.imatrix \
  Qwen3-Coder-Next-Q8_0.gguf  Qwen3-Coder-Next-ROCmFP4-STRIX-embQ8-imatrix-headQ6.gguf  Q4_0_ROCMFP4_STRIX

Experimental research build for AMD Strix Halo — hardware/driver/prompt-sensitive, may not reproduce elsewhere. Not native FP4 tensor-core execution.

06 · LINEAGE & CREDITS

BASE MODEL	Qwen/Qwen3-Coder-Next (Apache-2.0, Qwen team) · 80B-A3B MoE, arch `qwen3next`
CALIBRATION	eaddario/imatrix-calibration (code) · Kalomaze `groups_merged` via froggeric/imatrix (general)
FORMAT + RUNTIME	charlie12345/rocmfp4-llama (based on llama.cpp, MIT)

Derivative quantization — verify the base model's license before redistribution / use.

Downloads last month: 1,410

GGUF

Model size

80B params

Architecture

qwen3next

Hardware compatibility

We're not able to determine the quantization variants.

View all variants

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for plunderstruck/Qwen3-Coder-Next-ROCmFP4-GGUF

Base model

Qwen/Qwen3-Coder-Next

Quantized

(96)

this model

FORMAT ROCmFP4 4-BIT	PRECISION ~4.5 BPW	ARCH QWEN3NEXT	CONTEXT 262 K
PARAMS 80B · A3B MoE	DRAFT NO MTP	BACKEND VULKAN0	LICENSE APACHE-2.0