--- license: gemma base_model: coder3101/gemma-4-26B-A4B-it-heretic library_name: mlx pipeline_tag: text-generation tags: - mlx - quantized - apple-silicon - gemma-4 - moe - heretic - uncensored - mixed-precision --- # gemma-4-26B-A4B-it-heretic-mlx-mixed-4-8 (v1 — asymmetric MoE recipe) MLX mixed-precision conversion of [`coder3101/gemma-4-26B-A4B-it-heretic`](https://huggingface.co/coder3101/gemma-4-26B-A4B-it-heretic). **v1** in the iterative quantization series. Applies an **asymmetric MoE recipe**: 8-bit on the always-on hot path (dense MLP + router), 4-bit on sparse routed experts. Recovers most of the perplexity gap vs. the v0 standard 4-bit baseline at the cost of only ~1 GB extra disk and ~10% generation speed. ## Quantization Recipe | Component | Bits | Group size | Why | |-----------|------|------------|-----| | `*.mlp.gate_proj` (dense) | **8** | 64 | always-on hot path, every token routes through it | | `*.mlp.up_proj` (dense) | **8** | 64 | same | | `*.mlp.down_proj` (dense) | **8** | 64 | same | | `*.router.proj` | **8** | 64 | routing decisions are 1×N, error compounds | | `*.experts.switch_glu.*` | 4 | 64 | sparse top-8 / 128, error averages out | | Attention (q/k/v/o) | 4 | 64 | default mlx-lm | | embed / norms | default | — | mlx-lm leaves these unquantized | **Effective bpw**: 4.587 (vs. v0's ~4.5). 30 layers × 4 overrides = 120 per-layer 8-bit specs. Implemented via `quant_predicate` callback ([source](https://github.com/BRlin/mlx-model-lab/blob/main/scripts/convert_gemma4_moe_custom.py)): ```python def gemma4_moe_predicate(path, _module): if any(s in path for s in (".mlp.gate_proj", ".mlp.up_proj", ".mlp.down_proj")): return {"group_size": 64, "bits": 8} if path.endswith("router.proj"): return {"group_size": 64, "bits": 8} return True # base 4-bit ``` ## Benchmarks (Apple M4 Pro 48GB, mlx-lm 0.31.2) ### Quality | Metric | v0 (standard 4-bit) | **v1 (mixed 4/8)** | Δ | |--------|---------------------|---------------------|---| | Perplexity | 156.93 ± 2.77 | **119.87 ± 2.09** | **−23.6%** ✅ | | Eval time | 226 s | 184 s | −19% | | Eval throughput (tok/s) | 579 | 710 | +23% | Dataset: `allenai/tulu-3-sft-mixture`, 256 samples × 512 tok = 131,072 tokens, batch 8. **Reference**: `mlx-community/...-4bit` reports PPL ~109.4 on the same eval. v1 closes the gap from 43% (v0) to **9.6%** of mlx-community. ### Generation Speed | Metric | v0 | **v1** | Δ | |--------|-----|--------|---| | Prefill (tok/s) | 769 | 729 | −5.2% | | Generation (tok/s) | 75.1 | 67.6 | −10% | | Inference peak memory (GB) | 14.7 | 15.0 | +0.3 GB | Test config: `prompt_tokens=512, generation_tokens=128, batch_size=1`, 5 trials averaged. ### Disk Footprint | Variant | Size | |---------|------| | Original (bf16) | ~52 GB | | v0 (standard 4-bit) | 13 GB | | **v1 (mixed 4/8)** | **14 GB** | ## Quality vs. Speed Trade-off | | v0 | v1 | Verdict | |---|---|---|---------| | PPL | 156.93 | 119.87 | **v1 +23.6%** | | Gen TPS | 75.1 | 67.6 | v0 +11% | For most use cases, v1 is the better default — the perplexity improvement is large and visible in generation quality, while the speed cost is small. ## Usage ```python from mlx_lm import load, generate model, tokenizer = load("BRlin/gemma-4-26B-A4B-it-heretic-mlx-mixed-4-8") response = generate(model, tokenizer, prompt="Hello", max_tokens=100, verbose=True) ``` CLI: ```bash mlx_lm.generate --model BRlin/gemma-4-26B-A4B-it-heretic-mlx-mixed-4-8 \ --prompt "Explain quantization in one paragraph." --max-tokens 200 ``` ## Variant Index | Version | Repo | Recipe | PPL | Gen TPS | Disk | Status | |---------|------|--------|-----|---------|------|--------| | v0 | [`gemma-4-26B-A4B-it-heretic-mlx-4bit`](https://huggingface.co/BRlin/gemma-4-26B-A4B-it-heretic-mlx-4bit) | Standard 4-bit | 156.93 | 75.1 | 13 GB | baseline | | **v1** (this) | `gemma-4-26B-A4B-it-heretic-mlx-mixed-4-8` | 8-bit dense MLP + router, 4-bit experts | **119.87** | 67.6 | 14 GB | **recommended default** | | v2 | `gemma-4-26B-A4B-it-heretic-mlx-awq-mixed-4-8` | v1 + AWQ calibration | TBD | TBD | TBD | planned | | v3 | `gemma-4-26B-A4B-it-heretic-mlx-dwq-mixed-4-8` | v1/v2 + DWQ distillation | TBD | TBD | TBD | planned | ## Hardware & Software - **Hardware**: Apple M4 Pro, 48 GB unified memory, 20 GPU cores - **Software**: macOS 15, mlx 0.31.1, mlx-lm 0.31.2, Python 3.12.9 ## Known Risks - **Metal kernel bug ([ml-explore/mlx#3393](https://github.com/ml-explore/mlx/issues/3393))**: Gemma-4 26B-A4B (128 experts top-8) produces garbage on base M4 (10 GPU cores). This v1 was converted on M4 Pro (20 cores) and produces coherent output, but **untested on lower-end M4**. ## Acknowledgements - [coder3101](https://huggingface.co/coder3101) — original Heretic-aligned weights - [mlx-community](https://huggingface.co/mlx-community) — reference recipe inspiration - [Alex Barron](https://github.com/barronalex) — `quant_predicate` API contribution to mlx-lm - [APEX (Hu et al. 2025)](https://arxiv.org/abs/2506.04450), [QuantMoE-Bench (Liu et al. 2024)](https://arxiv.org/abs/2406.08155) — empirical validation of asymmetric MoE quantization ## License Inherits from base model: [Gemma Terms of Use](https://ai.google.dev/gemma/terms).