---
license: gemma
base_model: coder3101/gemma-4-26B-A4B-it-heretic
library_name: mlx
pipeline_tag: text-generation
tags:
  - mlx
  - quantized
  - apple-silicon
  - gemma-4
  - moe
  - heretic
  - uncensored
  - mixed-precision
---

# gemma-4-26B-A4B-it-heretic-mlx-mixed-4-8 (v1 — asymmetric MoE recipe)

MLX mixed-precision conversion of [`coder3101/gemma-4-26B-A4B-it-heretic`](https://huggingface.co/coder3101/gemma-4-26B-A4B-it-heretic).

**v1** in the iterative quantization series. Applies an **asymmetric MoE recipe**:
8-bit on the always-on hot path (dense MLP + router), 4-bit on sparse routed
experts. Recovers most of the perplexity gap vs. the v0 standard 4-bit baseline
at the cost of only ~1 GB extra disk and ~10% generation speed.

## Quantization Recipe

| Component | Bits | Group size | Why |
|-----------|------|------------|-----|
| `*.mlp.gate_proj` (dense) | **8** | 64 | always-on hot path, every token routes through it |
| `*.mlp.up_proj` (dense) | **8** | 64 | same |
| `*.mlp.down_proj` (dense) | **8** | 64 | same |
| `*.router.proj` | **8** | 64 | routing decisions are 1×N, error compounds |
| `*.experts.switch_glu.*` | 4 | 64 | sparse top-8 / 128, error averages out |
| Attention (q/k/v/o) | 4 | 64 | default mlx-lm |
| embed / norms | default | — | mlx-lm leaves these unquantized |

**Effective bpw**: 4.587 (vs. v0's ~4.5). 30 layers × 4 overrides = 120 per-layer 8-bit specs.

Implemented via `quant_predicate` callback ([source](https://github.com/BRlin/mlx-model-lab/blob/main/scripts/convert_gemma4_moe_custom.py)):

```python
def gemma4_moe_predicate(path, _module):
    if any(s in path for s in (".mlp.gate_proj", ".mlp.up_proj", ".mlp.down_proj")):
        return {"group_size": 64, "bits": 8}
    if path.endswith("router.proj"):
        return {"group_size": 64, "bits": 8}
    return True  # base 4-bit
```

## Benchmarks (Apple M4 Pro 48GB, mlx-lm 0.31.2)

### Quality

| Metric | v0 (standard 4-bit) | **v1 (mixed 4/8)** | Δ |
|--------|---------------------|---------------------|---|
| Perplexity | 156.93 ± 2.77 | **119.87 ± 2.09** | **−23.6%** ✅ |
| Eval time | 226 s | 184 s | −19% |
| Eval throughput (tok/s) | 579 | 710 | +23% |

Dataset: `allenai/tulu-3-sft-mixture`, 256 samples × 512 tok = 131,072 tokens, batch 8.

**Reference**: `mlx-community/...-4bit` reports PPL ~109.4 on the same eval. v1 closes the gap from 43% (v0) to **9.6%** of mlx-community.

### Generation Speed

| Metric | v0 | **v1** | Δ |
|--------|-----|--------|---|
| Prefill (tok/s) | 769 | 729 | −5.2% |
| Generation (tok/s) | 75.1 | 67.6 | −10% |
| Inference peak memory (GB) | 14.7 | 15.0 | +0.3 GB |

Test config: `prompt_tokens=512, generation_tokens=128, batch_size=1`, 5 trials averaged.

### Disk Footprint

| Variant | Size |
|---------|------|
| Original (bf16) | ~52 GB |
| v0 (standard 4-bit) | 13 GB |
| **v1 (mixed 4/8)** | **14 GB** |

## Quality vs. Speed Trade-off

| | v0 | v1 | Verdict |
|---|---|---|---------|
| PPL | 156.93 | 119.87 | **v1 +23.6%** |
| Gen TPS | 75.1 | 67.6 | v0 +11% |

For most use cases, v1 is the better default — the perplexity improvement is
large and visible in generation quality, while the speed cost is small.

## Usage

```python
from mlx_lm import load, generate

model, tokenizer = load("BRlin/gemma-4-26B-A4B-it-heretic-mlx-mixed-4-8")
response = generate(model, tokenizer, prompt="Hello", max_tokens=100, verbose=True)
```

CLI:

```bash
mlx_lm.generate --model BRlin/gemma-4-26B-A4B-it-heretic-mlx-mixed-4-8 \
  --prompt "Explain quantization in one paragraph." --max-tokens 200
```

## Variant Index

| Version | Repo | Recipe | PPL | Gen TPS | Disk | Status |
|---------|------|--------|-----|---------|------|--------|
| v0 | [`gemma-4-26B-A4B-it-heretic-mlx-4bit`](https://huggingface.co/BRlin/gemma-4-26B-A4B-it-heretic-mlx-4bit) | Standard 4-bit | 156.93 | 75.1 | 13 GB | baseline |
| **v1** (this) | `gemma-4-26B-A4B-it-heretic-mlx-mixed-4-8` | 8-bit dense MLP + router, 4-bit experts | **119.87** | 67.6 | 14 GB | **recommended default** |
| v2 | `gemma-4-26B-A4B-it-heretic-mlx-awq-mixed-4-8` | v1 + AWQ calibration | TBD | TBD | TBD | planned |
| v3 | `gemma-4-26B-A4B-it-heretic-mlx-dwq-mixed-4-8` | v1/v2 + DWQ distillation | TBD | TBD | TBD | planned |

## Hardware & Software

- **Hardware**: Apple M4 Pro, 48 GB unified memory, 20 GPU cores
- **Software**: macOS 15, mlx 0.31.1, mlx-lm 0.31.2, Python 3.12.9

## Known Risks

- **Metal kernel bug ([ml-explore/mlx#3393](https://github.com/ml-explore/mlx/issues/3393))**:
  Gemma-4 26B-A4B (128 experts top-8) produces garbage on base M4 (10 GPU cores).
  This v1 was converted on M4 Pro (20 cores) and produces coherent output, but
  **untested on lower-end M4**.

## Acknowledgements

- [coder3101](https://huggingface.co/coder3101) — original Heretic-aligned weights
- [mlx-community](https://huggingface.co/mlx-community) — reference recipe inspiration
- [Alex Barron](https://github.com/barronalex) — `quant_predicate` API contribution to mlx-lm
- [APEX (Hu et al. 2025)](https://arxiv.org/abs/2506.04450), [QuantMoE-Bench (Liu et al. 2024)](https://arxiv.org/abs/2406.08155) — empirical validation of asymmetric MoE quantization

## License

Inherits from base model: [Gemma Terms of Use](https://ai.google.dev/gemma/terms).