Qwen3.6-27B — UD-MXFP8_K_XL (mlx-node)

MXFP8 (OCP micro-scaling FP8) quantization of Qwen/Qwen3.6-27B for Apple Silicon, using the Unsloth Dynamic quantization strategy via mlx-node.

Original (BF16) UD-Q8_K_XL (affine) This Model
Size ~66 GB 30 GB 29 GB
Format SafeTensors SafeTensors SafeTensors
Precision BF16 uniform 8-bit affine + BF16 MXFP8 (E8M0 scales) + BF16
FFN group size 64 32
Biases yes no

What is MXFP8?

MXFP8 is the Open Compute Project (OCP) micro-scaling FP8 format. Each group of 32 elements shares a single 8-bit E8M0 scale (a power-of-two exponent), and elements themselves are stored as E4M3 FP8 values. Compared to 8-bit affine:

  • Half the scale storage: uint8 E8M0 vs. fp16/fp32 affine scales
  • No biases: zero-point implicit (FP8 covers ±range)
  • Hardware-friendly: scale is just an exponent shift, no FP multiply on the scale path

For typical LLM weight distributions, MXFP8 retains quality on par with 8-bit affine while shrinking the metadata footprint.

All Variants

Repo GGUF Equivalent Size Decode (tok/s)
Brooooooklyn/Qwen3.6-27B-UD-Q2_K_XL-mlx UD-Q2_K_XL 15 GB 20.0
Brooooooklyn/Qwen3.6-27B-UD-Q3_K_XL-mlx UD-Q3_K_XL 18 GB 16.2
Brooooooklyn/Qwen3.6-27B-UD-NVFP4_K_XL-mlx 21 GB 14.9
Brooooooklyn/Qwen3.6-27B-UD-MXFP4_K_XL-mlx 21 GB 15.9
Brooooooklyn/Qwen3.6-27B-UD-Q4_K_XL-mlx UD-Q4_K_XL 21 GB 15.3
Brooooooklyn/Qwen3.6-27B-UD-Q5_K_XL-mlx UD-Q5_K_XL 25 GB 13.4
Brooooooklyn/Qwen3.6-27B-UD-Q6_K_XL-mlx UD-Q6_K_XL 27 GB 12.4
Brooooooklyn/Qwen3.6-27B-UD-MXFP8_K_XL-mlx (this model) 29 GB 10.5
Brooooooklyn/Qwen3.6-27B-UD-Q8_K_XL-mlx UD-Q8_K_XL 30 GB 9.9

Benchmarked on Apple M3 Max 128GB via examples/lm.ts (best decode tok/s across turns 2–4, steady-state).

Performance

Steady-state decode: 10.5 tok/s on Apple M3 Max 128GB (best of turns 2–4, examples/lm.ts capitals chat with reasoningEffort: 'low'). Decode is memory-bandwidth bound on Apple Silicon — fewer bytes per token directly translates to higher throughput.

Per-Tensor Bit Assignments (N=8)

Weight Mode Bits Group Rationale
embed_tokens 8-bit affine 8 64 Loader is affine-only; mxfp upgrade skipped
lm_head 8-bit affine 8 64 Loader is affine-only; mxfp upgrade skipped
self_attn.q/k/v_proj mxfp8 + AWQ 8 32 KLD ~1.5–2.9, AWQ via input_layernorm
linear_attn.in_proj_qkv/z mxfp8 + AWQ 8 32 KLD ~2.9, AWQ via input_layernorm
self_attn.o_proj bf16 NOT AWQ-correctable
linear_attn.out_proj bf16 KLD ~6.0 — worst tensor, kept full-precision
down_proj mxfp8 8 32 "Slightly more sensitive"
gate_proj, up_proj mxfp8 8 32 base bits
GDN params (A_log, etc) bf16 State-space dynamics

Quantization Strategy

Built on Unsloth Dynamic 2.0 per-tensor KLD analysis. At --q-bits 8 the unsloth recipe's per-layer bit offsets all snap to 8-bit, then --q-mxfp orthogonally promotes every 8-bit affine decision to MXFP8 (mode="mxfp8", bits=8, group_size=32) — except for keys whose dequantizers are affine-only (lm_head, embed_tokens).

imatrix AWQ pre-scaling amplifies important weight channels and fuses inverse scales into preceding layer norms (zero inference overhead). AWQ-correctable projections (q/k/v, in_proj_qkv/z) get the AWQ pass; non-AWQ-correctable projections (o_proj, out_proj) stay bf16 — their inputs come from attention/GDN computation, not from a norm layer.

Architecture

Parameter Value
Total parameters 27.4B (dense — all active)
Hidden size 5,120
Layers 64 (48 linear + 16 full attention)
Attention heads 24 (4 KV heads, GQA 6:1)
Head dimension 256
Intermediate size 17,408
Vocab size 248,320
Max context 262,144 tokens

Usage

import { loadSession } from '@mlx-node/lm';

const session = await loadSession('./Qwen3.6-27B-UD-MXFP8_K_XL-mlx');

for await (const event of session.sendStream('Explain the hybrid attention mechanism in Qwen3.6.', {
  config: { maxNewTokens: 2048, temperature: 0.6, reasoningEffort: 'low' },
})) {
  if (!event.done) process.stdout.write(event.text);
}

How It Was Made

mlx convert \
  -i Qwen3.6-27B \
  -o Qwen3.6-27B-UD-MXFP8_K_XL-mlx \
  -q --q-bits 8 --q-mxfp --q-recipe unsloth \
  --imatrix-path imatrix_unsloth.gguf

--q-mxfp is mlx-node's MXFP toggle: starting from affine baseline decisions (from the recipe), it promotes 8-bit → MXFP8 and 4-bit → MXFP4 at group_size=32, while leaving non-quantized layers (bf16) untouched. It is orthogonal to recipes — combine with any of unsloth, qwen3_5, mixed_* to inherit per-layer bit selection.

Acknowledgments

License

Apache 2.0 (inherited from base model).

Downloads last month
269
Safetensors
Model size
9B params
Tensor type
BF16
·
U32
·
U8
·
MLX
Hardware compatibility
Log In to add your hardware

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Brooooooklyn/Qwen3.6-27B-UD-MXFP8_K_XL-mlx

Base model

Qwen/Qwen3.6-27B
Quantized
(493)
this model

Collection including Brooooooklyn/Qwen3.6-27B-UD-MXFP8_K_XL-mlx