How to use from the
Use from the
MLX library
# Make sure mlx-lm is installed
# pip install --upgrade mlx-lm

# Generate text with mlx-lm
from mlx_lm import load, generate

model, tokenizer = load("Brooooooklyn/Qwen3.6-35B-A3B-UD-MXFP8_K_XL-mlx")

prompt = "Write a story about Einstein"
messages = [{"role": "user", "content": prompt}]
prompt = tokenizer.apply_chat_template(
    messages, add_generation_prompt=True
)

text = generate(model, tokenizer, prompt=prompt, verbose=True)

Qwen3.6-35B-A3B — UD-MXFP8_K_XL (mlx-node)

MXFP8 (OCP micro-scaling FP8) quantization of Qwen/Qwen3.6-35B-A3B for Apple Silicon, using the Unsloth Dynamic quantization strategy via mlx-node.

Original (BF16) UD-Q8_K_XL (affine) This Model
Size ~66 GB 36 GB 35 GB
Format SafeTensors SafeTensors SafeTensors
Precision BF16 uniform 8-bit affine + BF16 MXFP8 (E8M0 scales) + 8-bit affine router gates + BF16
FFN group size 64 32
Biases yes no (FFN); yes (router gates)

What is MXFP8?

MXFP8 is the Open Compute Project (OCP) micro-scaling FP8 format. Each group of 32 elements shares a single 8-bit E8M0 scale (a power-of-two exponent), and elements themselves are stored as E4M3 FP8 values. Compared to 8-bit affine:

  • Half the scale storage: uint8 E8M0 vs. fp16/fp32 affine scales
  • No biases: zero-point implicit (FP8 covers ±range)
  • Hardware-friendly: scale is just an exponent shift, no FP multiply on the scale path

For typical LLM weight distributions, MXFP8 retains quality on par with 8-bit affine while shrinking the metadata footprint.

Note on router gates: MoE router gates (mlp.gate, mlp.shared_expert_gate) stay 8-bit affine even under --q-mxfp. MXFP8 quantization noise on a 256-expert router induces ~10× higher relative error than affine, which flips top-K expert selection and destroys generation quality. mlx-lm hardcodes router gates to affine for the same reason.

All Variants

Benchmarked on Apple M3 Max 128GB via examples/lm.ts (best decode tok/s across turns 2–4, steady-state).

Performance

Steady-state decode: 47.6 tok/s on Apple M3 Max 128GB (best of turns 2–4, examples/lm.ts capitals chat with reasoningEffort: 'low').

Decode is memory-bandwidth bound on Apple Silicon — fewer bytes per token directly translates to higher throughput. The MoE architecture activates only 8 of 256 experts per token (~3B active out of 35.9B total), and the compiled C++ forward graph fuses the per-layer dispatch (post-PR ~20% MXFP8 speedup vs the prior Rust forward path).

Per-Tensor Bit Assignments (N=8)

Weight Mode Bits Group Rationale
embed_tokens 8-bit affine 8 64 Loader is affine-only; mxfp upgrade skipped
lm_head 8-bit affine 8 64 Loader is affine-only; mxfp upgrade skipped
self_attn.q/k/v_proj mxfp8 + AWQ 8 32 KLD ~1.5–2.9, AWQ via input_layernorm
linear_attn.in_proj_qkv/z mxfp8 + AWQ 8 32 KLD ~2.9, AWQ via input_layernorm
self_attn.o_proj bf16 NOT AWQ-correctable
linear_attn.out_proj bf16 KLD ~6.0 — worst tensor
down_proj mxfp8 8 32 "Slightly more sensitive"
gate_proj, up_proj mxfp8 8 32 base bits
Router gates (mlp.gate, shared_expert_gate) 8-bit affine 8 64 MoE routing accuracy — MXFP8 noise breaks top-K
GDN params (A_log, etc) bf16 State-space dynamics

Quantization Strategy

Built on Unsloth Dynamic 2.0 per-tensor KLD analysis. At --q-bits 8 the unsloth recipe's per-layer bit offsets all snap to 8-bit, then --q-mxfp orthogonally promotes every 8-bit affine decision to MXFP8 (mode="mxfp8", bits=8, group_size=32) — except for keys whose dequantizers are affine-only (lm_head, embed_tokens) and MoE router gates (where MXFP8 quantization noise destroys routing accuracy).

imatrix AWQ pre-scaling amplifies important weight channels and fuses inverse scales into preceding layer norms (zero inference overhead). AWQ-correctable projections (q/k/v, in_proj_qkv/z) get the AWQ pass; non-AWQ-correctable projections (o_proj, out_proj) stay bf16 — their inputs come from attention/GDN computation, not from a norm layer.

Architecture

Parameter Value
Total parameters 35.9B (3B active per token)
Hidden size 2,048
Layers 40 (30 linear + 10 full attention)
Attention heads 16 (2 KV heads, GQA 8:1)
Head dimension 256
Experts 256 per MoE layer, top-8 routing
Vocab size 248,320
Max context 262,144 tokens

Usage

import { loadSession } from '@mlx-node/lm';

const session = await loadSession('./Qwen3.6-35B-A3B-UD-MXFP8_K_XL-mlx');

for await (const event of session.sendStream('Explain MXFP8 vs 8-bit affine quantization.', {
  config: { maxNewTokens: 2048, temperature: 0.6, reasoningEffort: 'low' },
})) {
  if (!event.done) process.stdout.write(event.text);
}

How It Was Made

mlx convert \
  -i Qwen3.6-35B-A3B \
  -o Qwen3.6-35B-A3B-UD-MXFP8_K_XL-mlx \
  -q --q-bits 8 --q-mxfp --q-recipe unsloth \
  --imatrix-path imatrix_unsloth.gguf

--q-mxfp is mlx-node's MXFP toggle: starting from affine baseline decisions (from the recipe), it promotes 8-bit → MXFP8 and 4-bit → MXFP4 at group_size=32, while leaving non-quantized layers (bf16) and MoE router gates untouched. It is orthogonal to recipes — combine with any of unsloth, qwen3_5, mixed_* to inherit per-layer bit selection.

Acknowledgments

License

Apache 2.0 (inherited from base model).

Downloads last month
1,448
Safetensors
Model size
10B params
Tensor type
BF16
·
U32
·
U8
·
MLX
Hardware compatibility
Log In to add your hardware

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Brooooooklyn/Qwen3.6-35B-A3B-UD-MXFP8_K_XL-mlx

Quantized
(418)
this model

Collection including Brooooooklyn/Qwen3.6-35B-A3B-UD-MXFP8_K_XL-mlx