gemma-4-26B-A4B-it-heretic-mlx-mixed-4-8 (v1 — asymmetric MoE recipe)

MLX mixed-precision conversion of coder3101/gemma-4-26B-A4B-it-heretic.

v1 in the iterative quantization series. Applies an asymmetric MoE recipe: 8-bit on the always-on hot path (dense MLP + router), 4-bit on sparse routed experts. Recovers most of the perplexity gap vs. the v0 standard 4-bit baseline at the cost of only ~1 GB extra disk and ~10% generation speed.

Quantization Recipe

Component Bits Group size Why
*.mlp.gate_proj (dense) 8 64 always-on hot path, every token routes through it
*.mlp.up_proj (dense) 8 64 same
*.mlp.down_proj (dense) 8 64 same
*.router.proj 8 64 routing decisions are 1×N, error compounds
*.experts.switch_glu.* 4 64 sparse top-8 / 128, error averages out
Attention (q/k/v/o) 4 64 default mlx-lm
embed / norms default mlx-lm leaves these unquantized

Effective bpw: 4.587 (vs. v0's ~4.5). 30 layers × 4 overrides = 120 per-layer 8-bit specs.

Implemented via quant_predicate callback (source):

def gemma4_moe_predicate(path, _module):
    if any(s in path for s in (".mlp.gate_proj", ".mlp.up_proj", ".mlp.down_proj")):
        return {"group_size": 64, "bits": 8}
    if path.endswith("router.proj"):
        return {"group_size": 64, "bits": 8}
    return True  # base 4-bit

Benchmarks (Apple M4 Pro 48GB, mlx-lm 0.31.2)

Quality

Metric v0 (standard 4-bit) v1 (mixed 4/8) Δ
Perplexity 156.93 ± 2.77 119.87 ± 2.09 −23.6%
Eval time 226 s 184 s −19%
Eval throughput (tok/s) 579 710 +23%

Dataset: allenai/tulu-3-sft-mixture, 256 samples × 512 tok = 131,072 tokens, batch 8.

Reference: mlx-community/...-4bit reports PPL ~109.4 on the same eval. v1 closes the gap from 43% (v0) to 9.6% of mlx-community.

Generation Speed

Metric v0 v1 Δ
Prefill (tok/s) 769 729 −5.2%
Generation (tok/s) 75.1 67.6 −10%
Inference peak memory (GB) 14.7 15.0 +0.3 GB

Test config: prompt_tokens=512, generation_tokens=128, batch_size=1, 5 trials averaged.

Disk Footprint

Variant Size
Original (bf16) ~52 GB
v0 (standard 4-bit) 13 GB
v1 (mixed 4/8) 14 GB

Quality vs. Speed Trade-off

v0 v1 Verdict
PPL 156.93 119.87 v1 +23.6%
Gen TPS 75.1 67.6 v0 +11%

For most use cases, v1 is the better default — the perplexity improvement is large and visible in generation quality, while the speed cost is small.

Usage

from mlx_lm import load, generate

model, tokenizer = load("BRlin/gemma-4-26B-A4B-it-heretic-mlx-mixed-4-8")
response = generate(model, tokenizer, prompt="Hello", max_tokens=100, verbose=True)

CLI:

mlx_lm.generate --model BRlin/gemma-4-26B-A4B-it-heretic-mlx-mixed-4-8 \
  --prompt "Explain quantization in one paragraph." --max-tokens 200

Variant Index

Version Repo Recipe PPL Gen TPS Disk Status
v0 gemma-4-26B-A4B-it-heretic-mlx-4bit Standard 4-bit 156.93 75.1 13 GB baseline
v1 (this) gemma-4-26B-A4B-it-heretic-mlx-mixed-4-8 8-bit dense MLP + router, 4-bit experts 119.87 67.6 14 GB recommended default
v2 gemma-4-26B-A4B-it-heretic-mlx-awq-mixed-4-8 v1 + AWQ calibration TBD TBD TBD planned
v3 gemma-4-26B-A4B-it-heretic-mlx-dwq-mixed-4-8 v1/v2 + DWQ distillation TBD TBD TBD planned

Hardware & Software

  • Hardware: Apple M4 Pro, 48 GB unified memory, 20 GPU cores
  • Software: macOS 15, mlx 0.31.1, mlx-lm 0.31.2, Python 3.12.9

Known Risks

  • Metal kernel bug (ml-explore/mlx#3393): Gemma-4 26B-A4B (128 experts top-8) produces garbage on base M4 (10 GPU cores). This v1 was converted on M4 Pro (20 cores) and produces coherent output, but untested on lower-end M4.

Acknowledgements

License

Inherits from base model: Gemma Terms of Use.

Downloads last month
84
Safetensors
Model size
25B params
Tensor type
BF16
·
U32
·
MLX
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for BRlin/gemma-4-26B-A4B-it-heretic-mlx-mixed-4-8

Quantized
(17)
this model

Papers for BRlin/gemma-4-26B-A4B-it-heretic-mlx-mixed-4-8