Gemma-4-26B-A4B-IT — UD-Q3_K_XL (mlx-node)

3-bit affine quantization of google/gemma-4-26b-a4b-it for Apple Silicon, using the Unsloth Dynamic quantization strategy via mlx-node.

Original (BF16) UD-Q3_K_XL (this model)
Size ~49 GB 14 GB
Format SafeTensors SafeTensors
Precision BF16 uniform 3-bit affine + mixed bits + BF16
FFN group size 64
Biases yes

All Variants

Benchmarked on Apple M3 Max 128GB via examples/lm.ts (best decode tok/s across turns 2–4, steady-state, capitals chat with reasoningEffort: 'low').

Note: No Q2 variant is published — Gemma-4-26B-A4B-IT has only ~4B active parameters per token, which is below the architectural redundancy needed for 2-bit quantization to remain coherent. Both unsloth and mixed_2_6 recipes produced gibberish at Q2 on this model.

Performance

Steady-state decode: 60.6 tok/s on Apple M3 Max 128GB (best of turns 2–4, examples/lm.ts capitals chat with reasoningEffort: 'low'). Decode is memory-bandwidth bound on Apple Silicon — fewer bytes per token directly translates to higher throughput. The MoE architecture activates only top-K of 128 experts per token (~4B active out of ~26B total), and the compiled C++ forward graph fuses the per-layer dispatch.

Per-Tensor Bit Assignments (N=3)

Weight Mode Bits Group Rationale
embed_tokens 5-bit affine 5 64 Tied with lm_head (Gemma4 shares weights); affine-only loader
self_attn.q_proj 5-bit affine 5 64 AWQ-corrected via input_layernorm
self_attn.k_proj 5-bit affine 5 64 AWQ-corrected via input_layernorm
self_attn.v_proj 5-bit affine 5 64 AWQ-corrected via input_layernorm (only on full-attention layers)
mlp.gate_proj 3-bit affine 3 64 Shared dense MLP (top-level default)
mlp.up_proj 3-bit affine 3 64 Shared dense MLP (top-level default)
mlp.down_proj 4-bit affine 4 64 Shared dense MLP; "slightly more sensitive" (unsloth base+1)
experts.switch_glu.gate_proj 3-bit affine 3 64 MoE expert gate (per-expert across all 128); base bits (top-level default)
experts.switch_glu.up_proj 3-bit affine 3 64 MoE expert up (per-expert across all 128); base bits (top-level default)
experts.switch_glu.down_proj 4-bit affine 4 64 MoE expert down (per-expert across all 128 + routing); unsloth base+1
router.proj 8-bit affine 8 64 MoE routing — low-bit noise breaks top-K expert selection
self_attn.o_proj bf16 NOT AWQ-correctable; kept full-precision

Quantization Strategy

Built on Unsloth Dynamic 2.0 per-tensor KLD analysis. At --q-bits 3 the unsloth recipe assigns the base bits to MLP gate/up projections (the bulk of the parameter budget), base+1 to down_proj (slightly more sensitive), base+2 (snapped to a valid bit width) + AWQ pre-scaling to attention q/k/v projections, base+2 to embed_tokens, base+3 (capped/snapped) to the routing-critical paths, and keeps self_attn.o_proj as bf16 (AWQ-uncorrectable — its inputs come from the attention compute, not from a norm layer). The MoE router (router.proj) is forced to 8-bit affine to preserve top-K expert selection accuracy.

imatrix AWQ pre-scaling amplifies important weight channels and fuses inverse scales into preceding layer norms (zero inference overhead).

Architecture

Parameter Value
Total parameters 26B (4B active per token)
Hidden size 2,816
Layers 30 (sliding-window attention)
Attention heads 16 (8 KV heads, GQA 2:1)
Head dimension 256
Experts 128 per MoE layer
MoE intermediate size 704
Vocab size 262,144
Max context 262,144 tokens
Vision yes (Gemma4ForConditionalGeneration)

Usage

import { loadSession } from '@mlx-node/lm';

const session = await loadSession('./Gemma-4-26B-A4B-IT-UD-Q3_K_XL-mlx');

for await (const event of session.sendStream('Explain the MoE architecture in Gemma-4.', {
  config: { maxNewTokens: 2048, temperature: 0.6, reasoningEffort: 'low' },
})) {
  if (!event.done) process.stdout.write(event.text);
}

How It Was Made

mlx convert \
  -i gemma-4-26b-a4b-it \
  -o Gemma-4-26B-A4B-IT-UD-Q3_K_XL-mlx \
  -q --q-bits 3 --q-recipe unsloth \
  --imatrix-path imatrix_unsloth.gguf

Acknowledgments

License

Gemma Terms of Use (inherited from base model).

Downloads last month
194
Safetensors
Model size
4B params
Tensor type
BF16
·
U32
·
MLX
Hardware compatibility
Log In to add your hardware

3-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including Brooooooklyn/Gemma-4-26B-A4B-IT-UD-Q3_K_XL-mlx