Instructions to use BRlin/gemma-4-26B-A4B-it-heretic-mlx-mixed-4-8 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use BRlin/gemma-4-26B-A4B-it-heretic-mlx-mixed-4-8 with MLX:

# Make sure mlx-lm is installed
# pip install --upgrade mlx-lm

# Generate text with mlx-lm
from mlx_lm import load, generate

model, tokenizer = load("BRlin/gemma-4-26B-A4B-it-heretic-mlx-mixed-4-8")

prompt = "Write a story about Einstein"
messages = [{"role": "user", "content": prompt}]
prompt = tokenizer.apply_chat_template(
    messages, add_generation_prompt=True
)

text = generate(model, tokenizer, prompt=prompt, verbose=True)

Notebooks
Google Colab
Kaggle
Local Apps
LM Studio

How to use BRlin/gemma-4-26B-A4B-it-heretic-mlx-mixed-4-8 with Pi:

Start the MLX server

# Install MLX LM:
uv tool install mlx-lm
# Start a local OpenAI-compatible server:
mlx_lm.server --model "BRlin/gemma-4-26B-A4B-it-heretic-mlx-mixed-4-8"

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "mlx-lm": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "BRlin/gemma-4-26B-A4B-it-heretic-mlx-mixed-4-8"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use BRlin/gemma-4-26B-A4B-it-heretic-mlx-mixed-4-8 with Hermes Agent:

Start the MLX server

# Install MLX LM:
uv tool install mlx-lm
# Start a local OpenAI-compatible server:
mlx_lm.server --model "BRlin/gemma-4-26B-A4B-it-heretic-mlx-mixed-4-8"

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default BRlin/gemma-4-26B-A4B-it-heretic-mlx-mixed-4-8

Run Hermes

hermes

MLX LM

How to use BRlin/gemma-4-26B-A4B-it-heretic-mlx-mixed-4-8 with MLX LM:

Generate or start a chat session

# Install MLX LM
uv tool install mlx-lm
# Interactive chat REPL
mlx_lm.chat --model "BRlin/gemma-4-26B-A4B-it-heretic-mlx-mixed-4-8"

Run an OpenAI-compatible server

# Install MLX LM
uv tool install mlx-lm
# Start the server
mlx_lm.server --model "BRlin/gemma-4-26B-A4B-it-heretic-mlx-mixed-4-8"
# Calling the OpenAI-compatible server with curl
curl -X POST "http://localhost:8000/v1/chat/completions" \
   -H "Content-Type: application/json" \
   --data '{
     "model": "BRlin/gemma-4-26B-A4B-it-heretic-mlx-mixed-4-8",
     "messages": [
       {"role": "user", "content": "Hello"}
     ]
   }'

gemma-4-26B-A4B-it-heretic-mlx-mixed-4-8 (v1 — asymmetric MoE recipe)

MLX mixed-precision conversion of coder3101/gemma-4-26B-A4B-it-heretic.

v1 in the iterative quantization series. Applies an asymmetric MoE recipe: 8-bit on the always-on hot path (dense MLP + router), 4-bit on sparse routed experts. Recovers most of the perplexity gap vs. the v0 standard 4-bit baseline at the cost of only ~1 GB extra disk and ~10% generation speed.

Quantization Recipe

Component	Bits	Group size	Why
`*.mlp.gate_proj` (dense)	8	64	always-on hot path, every token routes through it
`*.mlp.up_proj` (dense)	8	64	same
`*.mlp.down_proj` (dense)	8	64	same
`*.router.proj`	8	64	routing decisions are 1×N, error compounds
`.experts.switch_glu.`	4	64	sparse top-8 / 128, error averages out
Attention (q/k/v/o)	4	64	default mlx-lm
embed / norms	default	—	mlx-lm leaves these unquantized

Effective bpw: 4.587 (vs. v0's ~4.5). 30 layers × 4 overrides = 120 per-layer 8-bit specs.

Implemented via quant_predicate callback (source):

def gemma4_moe_predicate(path, _module):
    if any(s in path for s in (".mlp.gate_proj", ".mlp.up_proj", ".mlp.down_proj")):
        return {"group_size": 64, "bits": 8}
    if path.endswith("router.proj"):
        return {"group_size": 64, "bits": 8}
    return True  # base 4-bit

Benchmarks (Apple M4 Pro 48GB, mlx-lm 0.31.2)

Quality

Metric	v0 (standard 4-bit)	v1 (mixed 4/8)	Δ
Perplexity	156.93 ± 2.77	119.87 ± 2.09	−23.6% ✅
Eval time	226 s	184 s	−19%
Eval throughput (tok/s)	579	710	+23%

Dataset: allenai/tulu-3-sft-mixture, 256 samples × 512 tok = 131,072 tokens, batch 8.

Reference: mlx-community/...-4bit reports PPL ~109.4 on the same eval. v1 closes the gap from 43% (v0) to 9.6% of mlx-community.

Generation Speed

Metric	v0	v1	Δ
Prefill (tok/s)	769	729	−5.2%
Generation (tok/s)	75.1	67.6	−10%
Inference peak memory (GB)	14.7	15.0	+0.3 GB

Test config: prompt_tokens=512, generation_tokens=128, batch_size=1, 5 trials averaged.

Disk Footprint

Variant	Size
Original (bf16)	~52 GB
v0 (standard 4-bit)	13 GB
v1 (mixed 4/8)	14 GB

Quality vs. Speed Trade-off

	v0	v1	Verdict
PPL	156.93	119.87	v1 +23.6%
Gen TPS	75.1	67.6	v0 +11%

For most use cases, v1 is the better default — the perplexity improvement is large and visible in generation quality, while the speed cost is small.

Usage

from mlx_lm import load, generate

model, tokenizer = load("BRlin/gemma-4-26B-A4B-it-heretic-mlx-mixed-4-8")
response = generate(model, tokenizer, prompt="Hello", max_tokens=100, verbose=True)

CLI:

mlx_lm.generate --model BRlin/gemma-4-26B-A4B-it-heretic-mlx-mixed-4-8 \
  --prompt "Explain quantization in one paragraph." --max-tokens 200

Variant Index

Version	Repo	Recipe	PPL	Gen TPS	Disk	Status
v0	`gemma-4-26B-A4B-it-heretic-mlx-4bit`	Standard 4-bit	156.93	75.1	13 GB	baseline
v1 (this)	`gemma-4-26B-A4B-it-heretic-mlx-mixed-4-8`	8-bit dense MLP + router, 4-bit experts	119.87	67.6	14 GB	recommended default
v2	`gemma-4-26B-A4B-it-heretic-mlx-awq-mixed-4-8`	v1 + AWQ calibration	TBD	TBD	TBD	planned
v3	`gemma-4-26B-A4B-it-heretic-mlx-dwq-mixed-4-8`	v1/v2 + DWQ distillation	TBD	TBD	TBD	planned

Hardware & Software

Hardware: Apple M4 Pro, 48 GB unified memory, 20 GPU cores
Software: macOS 15, mlx 0.31.1, mlx-lm 0.31.2, Python 3.12.9

Known Risks

Metal kernel bug (ml-explore/mlx#3393): Gemma-4 26B-A4B (128 experts top-8) produces garbage on base M4 (10 GPU cores). This v1 was converted on M4 Pro (20 cores) and produces coherent output, but untested on lower-end M4.

Acknowledgements

coder3101 — original Heretic-aligned weights
mlx-community — reference recipe inspiration
Alex Barron — quant_predicate API contribution to mlx-lm
APEX (Hu et al. 2025), QuantMoE-Bench (Liu et al. 2024) — empirical validation of asymmetric MoE quantization