Instructions to use BRlin/gemma-4-26B-A4B-it-heretic-mlx-mixed-4-8 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use BRlin/gemma-4-26B-A4B-it-heretic-mlx-mixed-4-8 with MLX:
# Make sure mlx-lm is installed # pip install --upgrade mlx-lm # Generate text with mlx-lm from mlx_lm import load, generate model, tokenizer = load("BRlin/gemma-4-26B-A4B-it-heretic-mlx-mixed-4-8") prompt = "Write a story about Einstein" messages = [{"role": "user", "content": prompt}] prompt = tokenizer.apply_chat_template( messages, add_generation_prompt=True ) text = generate(model, tokenizer, prompt=prompt, verbose=True) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- LM Studio
- Pi
How to use BRlin/gemma-4-26B-A4B-it-heretic-mlx-mixed-4-8 with Pi:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "BRlin/gemma-4-26B-A4B-it-heretic-mlx-mixed-4-8"
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "mlx-lm": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "BRlin/gemma-4-26B-A4B-it-heretic-mlx-mixed-4-8" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use BRlin/gemma-4-26B-A4B-it-heretic-mlx-mixed-4-8 with Hermes Agent:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "BRlin/gemma-4-26B-A4B-it-heretic-mlx-mixed-4-8"
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default BRlin/gemma-4-26B-A4B-it-heretic-mlx-mixed-4-8
Run Hermes
hermes
- MLX LM
How to use BRlin/gemma-4-26B-A4B-it-heretic-mlx-mixed-4-8 with MLX LM:
Generate or start a chat session
# Install MLX LM uv tool install mlx-lm # Interactive chat REPL mlx_lm.chat --model "BRlin/gemma-4-26B-A4B-it-heretic-mlx-mixed-4-8"
Run an OpenAI-compatible server
# Install MLX LM uv tool install mlx-lm # Start the server mlx_lm.server --model "BRlin/gemma-4-26B-A4B-it-heretic-mlx-mixed-4-8" # Calling the OpenAI-compatible server with curl curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "BRlin/gemma-4-26B-A4B-it-heretic-mlx-mixed-4-8", "messages": [ {"role": "user", "content": "Hello"} ] }'
gemma-4-26B-A4B-it-heretic-mlx-mixed-4-8 (v1 — asymmetric MoE recipe)
MLX mixed-precision conversion of coder3101/gemma-4-26B-A4B-it-heretic.
v1 in the iterative quantization series. Applies an asymmetric MoE recipe: 8-bit on the always-on hot path (dense MLP + router), 4-bit on sparse routed experts. Recovers most of the perplexity gap vs. the v0 standard 4-bit baseline at the cost of only ~1 GB extra disk and ~10% generation speed.
Quantization Recipe
| Component | Bits | Group size | Why |
|---|---|---|---|
*.mlp.gate_proj (dense) |
8 | 64 | always-on hot path, every token routes through it |
*.mlp.up_proj (dense) |
8 | 64 | same |
*.mlp.down_proj (dense) |
8 | 64 | same |
*.router.proj |
8 | 64 | routing decisions are 1×N, error compounds |
*.experts.switch_glu.* |
4 | 64 | sparse top-8 / 128, error averages out |
| Attention (q/k/v/o) | 4 | 64 | default mlx-lm |
| embed / norms | default | — | mlx-lm leaves these unquantized |
Effective bpw: 4.587 (vs. v0's ~4.5). 30 layers × 4 overrides = 120 per-layer 8-bit specs.
Implemented via quant_predicate callback (source):
def gemma4_moe_predicate(path, _module):
if any(s in path for s in (".mlp.gate_proj", ".mlp.up_proj", ".mlp.down_proj")):
return {"group_size": 64, "bits": 8}
if path.endswith("router.proj"):
return {"group_size": 64, "bits": 8}
return True # base 4-bit
Benchmarks (Apple M4 Pro 48GB, mlx-lm 0.31.2)
Quality
| Metric | v0 (standard 4-bit) | v1 (mixed 4/8) | Δ |
|---|---|---|---|
| Perplexity | 156.93 ± 2.77 | 119.87 ± 2.09 | −23.6% ✅ |
| Eval time | 226 s | 184 s | −19% |
| Eval throughput (tok/s) | 579 | 710 | +23% |
Dataset: allenai/tulu-3-sft-mixture, 256 samples × 512 tok = 131,072 tokens, batch 8.
Reference: mlx-community/...-4bit reports PPL ~109.4 on the same eval. v1 closes the gap from 43% (v0) to 9.6% of mlx-community.
Generation Speed
| Metric | v0 | v1 | Δ |
|---|---|---|---|
| Prefill (tok/s) | 769 | 729 | −5.2% |
| Generation (tok/s) | 75.1 | 67.6 | −10% |
| Inference peak memory (GB) | 14.7 | 15.0 | +0.3 GB |
Test config: prompt_tokens=512, generation_tokens=128, batch_size=1, 5 trials averaged.
Disk Footprint
| Variant | Size |
|---|---|
| Original (bf16) | ~52 GB |
| v0 (standard 4-bit) | 13 GB |
| v1 (mixed 4/8) | 14 GB |
Quality vs. Speed Trade-off
| v0 | v1 | Verdict | |
|---|---|---|---|
| PPL | 156.93 | 119.87 | v1 +23.6% |
| Gen TPS | 75.1 | 67.6 | v0 +11% |
For most use cases, v1 is the better default — the perplexity improvement is large and visible in generation quality, while the speed cost is small.
Usage
from mlx_lm import load, generate
model, tokenizer = load("BRlin/gemma-4-26B-A4B-it-heretic-mlx-mixed-4-8")
response = generate(model, tokenizer, prompt="Hello", max_tokens=100, verbose=True)
CLI:
mlx_lm.generate --model BRlin/gemma-4-26B-A4B-it-heretic-mlx-mixed-4-8 \
--prompt "Explain quantization in one paragraph." --max-tokens 200
Variant Index
| Version | Repo | Recipe | PPL | Gen TPS | Disk | Status |
|---|---|---|---|---|---|---|
| v0 | gemma-4-26B-A4B-it-heretic-mlx-4bit |
Standard 4-bit | 156.93 | 75.1 | 13 GB | baseline |
| v1 (this) | gemma-4-26B-A4B-it-heretic-mlx-mixed-4-8 |
8-bit dense MLP + router, 4-bit experts | 119.87 | 67.6 | 14 GB | recommended default |
| v2 | gemma-4-26B-A4B-it-heretic-mlx-awq-mixed-4-8 |
v1 + AWQ calibration | TBD | TBD | TBD | planned |
| v3 | gemma-4-26B-A4B-it-heretic-mlx-dwq-mixed-4-8 |
v1/v2 + DWQ distillation | TBD | TBD | TBD | planned |
Hardware & Software
- Hardware: Apple M4 Pro, 48 GB unified memory, 20 GPU cores
- Software: macOS 15, mlx 0.31.1, mlx-lm 0.31.2, Python 3.12.9
Known Risks
- Metal kernel bug (ml-explore/mlx#3393): Gemma-4 26B-A4B (128 experts top-8) produces garbage on base M4 (10 GPU cores). This v1 was converted on M4 Pro (20 cores) and produces coherent output, but untested on lower-end M4.
Acknowledgements
- coder3101 — original Heretic-aligned weights
- mlx-community — reference recipe inspiration
- Alex Barron —
quant_predicateAPI contribution to mlx-lm - APEX (Hu et al. 2025), QuantMoE-Bench (Liu et al. 2024) — empirical validation of asymmetric MoE quantization
License
Inherits from base model: Gemma Terms of Use.
- Downloads last month
- 84
4-bit
Model tree for BRlin/gemma-4-26B-A4B-it-heretic-mlx-mixed-4-8
Base model
google/gemma-4-26B-A4B