How to use from
Pi
Start the MLX server
# Install MLX LM:
uv tool install mlx-lm
# Start a local OpenAI-compatible server:
mlx_lm.server --model "Brooooooklyn/Qwen3.6-35B-A3B-UD-MXFP4_K_XL-mlx"
Configure the model in Pi
# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "mlx-lm": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "Brooooooklyn/Qwen3.6-35B-A3B-UD-MXFP4_K_XL-mlx"
        }
      ]
    }
  }
}
Run Pi
# Start Pi in your project directory:
pi
Quick Links

Qwen3.6-35B-A3B — UD-MXFP4_K_XL (mlx-node)

MXFP4 (OCP micro-scaling FP4) quantization of Qwen/Qwen3.6-35B-A3B for Apple Silicon, using the Unsloth Dynamic quantization strategy via mlx-node.

Original (BF16) UD-Q4_K_XL (affine) This Model
Size ~66 GB 22 GB 21 GB
Format SafeTensors SafeTensors SafeTensors
Precision BF16 uniform 4-bit affine + BF16 MXFP4 (E8M0 scales) + 8-bit affine router gates + BF16
FFN group size 64 32
Biases yes no (FFN); yes (router gates)

What is MXFP4?

MXFP4 is the Open Compute Project (OCP) micro-scaling FP4 format. Each group of 32 elements shares a single 8-bit E8M0 scale (a power-of-two exponent), and elements themselves are stored as E2M1 FP4 values. Compared to 4-bit affine:

  • Half the scale storage: uint8 E8M0 vs. fp16/fp32 affine scales
  • No biases: zero-point implicit (FP4 covers ±range)
  • Hardware-friendly: scale is just an exponent shift, no FP multiply on the scale path

For typical LLM weight distributions, MXFP4 retains quality on par with 4-bit affine at a similar bit budget while shrinking the metadata footprint.

Note on router gates: MoE router gates (mlp.gate, mlp.shared_expert_gate) stay 8-bit affine even under --q-mxfp. MXFP4 quantization noise on a 256-expert router flips top-K expert selection and destroys generation quality. mlx-lm hardcodes router gates to affine for the same reason.

All Variants

Benchmarked on Apple M3 Max 128GB via examples/lm.ts (best decode tok/s across turns 2–4, steady-state).

Performance

Steady-state decode: 54.4 tok/s on Apple M3 Max 128GB (best of turns 2–4, examples/lm.ts capitals chat with reasoningEffort: 'low').

Decode is memory-bandwidth bound on Apple Silicon — fewer bytes per token directly translates to higher throughput. The MoE architecture activates only 8 of 256 experts per token (~3B active out of 35.9B total), and the compiled C++ forward graph fuses the per-layer dispatch (post-PR ~20% MXFP8 speedup vs the prior Rust forward path).

Per-Tensor Bit Assignments (N=4)

Weight Mode Bits Group Rationale
embed_tokens 8-bit affine 8 64 Loader is affine-only; mxfp upgrade skipped
lm_head 8-bit affine 8 64 Loader is affine-only; mxfp upgrade skipped
self_attn.q/k/v_proj mxfp4 + AWQ 4 32 AWQ via input_layernorm
linear_attn.in_proj_qkv/z mxfp4 + AWQ 4 32 AWQ via input_layernorm
self_attn.o_proj bf16 NOT AWQ-correctable
linear_attn.out_proj bf16 KLD ~6.0 — worst tensor, kept full-precision
down_proj mxfp4 4 32 Unsloth UD-Q4 base
gate_proj, up_proj mxfp4 4 32 Unsloth UD-Q4 base
Router gates (mlp.gate, shared_expert_gate) 8-bit affine 8 64 MoE routing accuracy — MXFP4 noise breaks top-K
GDN params (A_log, etc) bf16 State-space dynamics

Quantization Strategy

Built on Unsloth Dynamic 2.0 per-tensor KLD analysis. At --q-bits 4 the unsloth recipe's per-layer bit offsets all snap to 4-bit, then --q-mxfp orthogonally promotes every 4-bit affine decision to MXFP4 (mode="mxfp4", bits=4, group_size=32) — except for keys whose dequantizers are affine-only (lm_head, embed_tokens) and MoE router gates (where MXFP4 quantization noise destroys routing accuracy).

imatrix AWQ pre-scaling amplifies important weight channels and fuses inverse scales into preceding layer norms (zero inference overhead). AWQ-correctable projections (q/k/v, in_proj_qkv/z) get the AWQ pass; non-AWQ-correctable projections (o_proj, out_proj) stay bf16 — their inputs come from attention/GDN computation, not from a norm layer.

Architecture

Parameter Value
Total parameters 35.9B (3B active per token)
Hidden size 2,048
Layers 40 (30 linear + 10 full attention)
Attention heads 16 (2 KV heads, GQA 8:1)
Head dimension 256
Experts 256 per MoE layer, top-8 routing
Vocab size 248,320
Max context 262,144 tokens

Usage

import { loadSession } from '@mlx-node/lm';

const session = await loadSession('./Qwen3.6-35B-A3B-UD-MXFP4_K_XL-mlx');

for await (const event of session.sendStream('Explain MXFP4 vs 4-bit affine quantization.', {
  config: { maxNewTokens: 2048, temperature: 0.6, reasoningEffort: 'low' },
})) {
  if (!event.done) process.stdout.write(event.text);
}

How It Was Made

mlx convert \
  -i Qwen3.6-35B-A3B \
  -o Qwen3.6-35B-A3B-UD-MXFP4_K_XL-mlx \
  -q --q-bits 4 --q-mxfp --q-recipe unsloth \
  --imatrix-path imatrix_unsloth.gguf

--q-mxfp is mlx-node's MXFP toggle: starting from affine baseline decisions (from the recipe), it promotes 8-bit → MXFP8 and 4-bit → MXFP4 at group_size=32, while leaving non-quantized layers (bf16) and MoE router gates untouched. It is orthogonal to recipes — combine with any of unsloth, qwen3_5, mixed_* to inherit per-layer bit selection.

Acknowledgments

License

Apache 2.0 (inherited from base model).

Downloads last month
640
Safetensors
Model size
7B params
Tensor type
BF16
·
U32
·
U8
·
MLX
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Brooooooklyn/Qwen3.6-35B-A3B-UD-MXFP4_K_XL-mlx

Quantized
(418)
this model

Collection including Brooooooklyn/Qwen3.6-35B-A3B-UD-MXFP4_K_XL-mlx