---
library_name: mlx
license: gemma
license_link: https://ai.google.dev/gemma/docs/gemma_4_license
pipeline_tag: image-text-to-text
tags:
  - mlx
  - safetensors
  - gemma4
  - moe
  - pruning
  - reap
  - expert-pruning
  - 4-bit
  - quantized
  - apple-silicon
  - multimodal
  - vision
  - cerebras
  - turboquant
  - kv-cache-compression
  - long-context
  - ravenx
  - tool-calling
  - function-calling
  - ollama
base_model: 0xSero/gemma-4-21b-a4b-it-REAP
base_model_relation: quantized
language:
  - en
---

# Gemma 4 21B REAP — Tool Calling ✅ | 103 Experts | MLX 4-bit | Apple Silicon

> **The biggest open Gemma 4 on Apple Silicon. 21B MoE, REAP-pruned to 4 active experts, native tool calling, 12 GB MLX 4-bit. TurboQuant ready.**


**[0xSero/gemma-4-21b-a4b-it-REAP](https://huggingface.co/0xSero/gemma-4-21b-a4b-it-REAP)** converted to **MLX 4-bit** (affine, group_size=64) for native Apple Silicon inference.

This is **Gemma 4 27B MoE** pruned from 103 experts down to **4 active experts per token** using [REAP (Routing Expert Activation Pruning)](https://arxiv.org/abs/2510.13999) — yielding a model that runs at ~21B total params but activates only a fraction per forward pass, combining large capacity with fast inference.

> 🖤 **12 GB MLX 4-bit** — runs on any M-series Mac with 24GB+ unified memory.  
> Multimodal: text + vision. 131K context window.

## What is REAP?

REAP (Routing Expert Activation Pruning) is a technique from Cerebras that prunes MoE experts by analyzing routing patterns. Instead of activating many experts per token, REAP identifies which experts are actually essential and prunes the rest — resulting in:

- **Fewer experts activated per token** (4 active out of 103 total)
- **Faster inference** due to reduced compute per forward pass
- **Minimal quality loss** — BoolQ accuracy 76%, HellaSwag 46% (see evals below)

## Model Details

| Property | Value |
|----------|-------|
| **Base model** | 0xSero/gemma-4-21b-a4b-it-REAP |
| **Original base** | google/gemma-4-27b-it (MoE) |
| **Architecture** | Gemma4ForConditionalGeneration (MoE) |
| **Total parameters** | ~21B |
| **Total experts** | 103 |
| **Active experts/token** | 4 (REAP-pruned) |
| **Modalities** | Text · Vision |
| **Quantization** | 4-bit affine, group_size=64, ~4.8 bits/weight |
| **File size** | **12 GB** (down from ~40 GB bf16) |
| **Context window** | 131,072 tokens |
| **Vocab size** | 262,144 |

## Evaluation (from source model)

| Benchmark | Score |
|-----------|------:|
| BoolQ | 76% |
| HellaSwag | 46% |
| ARC-Challenge | 28% |

## Performance (Apple Silicon)

| Chip | RAM | Tok/sec (est) |
|------|-----|--------------|
| M4 Max 128GB | 128GB | ~20–30 tok/s |
| M3 Ultra 192GB | 192GB | ~25–35 tok/s |
| M2 Ultra 192GB | 192GB | ~18–25 tok/s |

> Requires at least **24GB unified memory**. 32GB+ recommended for comfortable operation.

## Quickstart

### Install

```bash
pip install mlx-lm mlx-vlm
```

### Text generation

```python
from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template

model, processor = load("deadbydawn101/gemma-4-21b-REAP-Tool-Calling-mlx-4bit")

messages = [{"role": "user", "content": "Explain mixture-of-experts models simply."}]
prompt = apply_chat_template(processor, model.config, messages)
response = generate(model, processor, prompt=prompt, max_tokens=512, verbose=True)
```

### Vision (image + text)

```python
from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template

model, processor = load("deadbydawn101/gemma-4-21b-REAP-Tool-Calling-mlx-4bit")

messages = [{
    "role": "user",
    "content": [
        {"type": "image", "image": "https://example.com/photo.jpg"},
        {"type": "text", "text": "Describe this image."}
    ]
}]
prompt = apply_chat_template(processor, model.config, messages, add_generation_prompt=True)
response = generate(model, processor, prompt=prompt, max_tokens=512)
```

### CLI

```bash
mlx_vlm.generate \
  --model deadbydawn101/gemma-4-21b-REAP-Tool-Calling-mlx-4bit \
  --prompt "What are the key differences between MoE and dense transformer models?" \
  --max-tokens 512
```


## ⚡ TurboQuant-MLX — 4.6x KV Cache Compression

Pair this model with **[TurboQuant-MLX](https://github.com/DeadByDawn101/turboquant-mlx)** — RavenX AI's Apple Silicon KV cache compression. Run **4.6x longer contexts** with near-zero accuracy loss by compressing the KV cache using PolarQuant + QJL residuals.

```python
from turboquant_mlx.mlx_kvcache import TurboQuantKVCache
import mlx_lm.models.cache as cache_module

# Patch mlx-lm to use TurboQuant compression
cache_module.make_prompt_cache = lambda model, **kw: [
    TurboQuantKVCache() for _ in range(len(model.layers))
]

# Now load and run as normal — context is compressed automatically
from mlx_vlm import load, generate
model, processor = load("deadbydawn101/gemma-4-21b-REAP-Tool-Calling-mlx-4bit")
```

| Without TurboQuant | With TurboQuant |
|---|---|
| 8K context @ 12 GB | 36K context @ ~12 GB |
| KV cache grows linearly | KV cache stays compressed |

→ [TurboQuant-MLX on GitHub](https://github.com/DeadByDawn101/turboquant-mlx) · [Release v2.0](https://github.com/DeadByDawn101/turboquant-mlx/releases/tag/v2.0.0)

## 🧠 Opus Reasoning + Claude Code LoRA

Supercharge this model with the **[Opus Reasoning + Claude Code LoRA](https://huggingface.co/deadbydawn101/gemma-4-E4B-opus-reasoning-claude-code-lora)** — trained on Claude Opus 4.6 reasoning traces and Claude Code tool-use patterns.

Apply it to get structured `<think>`-tag chain-of-thought reasoning and agentic tool-use behavior:

```python
from mlx_vlm import load, generate

model, processor = load(
    "deadbydawn101/gemma-4-21b-REAP-Tool-Calling-mlx-4bit",
    adapter_path="deadbydawn101/gemma-4-E4B-opus-reasoning-claude-code-lora",
)
```

| What it adds | Detail |
|---|---|
| **Reasoning style** | `<think>` tag chain-of-thought before every answer |
| **Training data** | Claude Opus 4.6 reasoning traces (2,054 examples) |
| **Tool-use patterns** | 140 Claude Code agentic pattern files |
| **Size** | 658 MB adapter on top of base model |

→ [View the adapter repo](https://huggingface.co/deadbydawn101/gemma-4-E4B-opus-reasoning-claude-code-lora)


## 💻 Gemini CLI — Coding Agent + Tool Orchestration

We use **[RavenX AI's Gemini CLI fork](https://github.com/DeadByDawn101/gemini-cli)** as the coding agent and tool orchestration layer on top of these models. This is what makes the tool-calling capability real in production.

Gemini CLI gives you a full agentic loop in the terminal — Google Search grounding, file read/write, shell execution, web fetching, and MCP server support — all wired to a 1M token context window.

```bash
# Install
npm install -g @google/gemini-cli

# Run as a coding agent against this model (via local mlx_lm server)
mlx_lm.server --model deadbydawn101/gemma-4-21b-REAP-Tool-Calling-mlx-4bit --port 8080 &
gemini --baseUrl http://localhost:8080

# Or use directly against Gemini API (free tier: 60 req/min)
gemini
```

### What Gemini CLI + these models unlock together

| Capability | How |
|---|---|
| **Code generation** | Gemini CLI reads your codebase, model reasons with `<think>` tags |
| **Tool calling** | Native `<\|tool>` tokens → Gemini CLI executes shell/file/web tools |
| **Long context** | 1M ctx in CLI + TurboQuant 4.6x KV compression = very long sessions |
| **MCP servers** | Connect any MCP server — databases, APIs, custom tools |
| **Search grounding** | Google Search built in — model gets live data |

```bash
# Real example: code review with tool calling enabled
gemini --baseUrl http://localhost:8080 \
  "Review all Python files in ./src, find potential bugs, and suggest fixes"

# Gemini CLI will: read files → call tools → model reasons → produce structured output
```

→ [DeadByDawn101/gemini-cli on GitHub](https://github.com/DeadByDawn101/gemini-cli) — Apache 2.0, free tier, MCP-compatible

## 🛠️ Tool Calling (Function Calling)

**Gemma 4 has native tool calling built into its chat template.** Most models on HuggingFace don't support this — Gemma 4 does, using `<|tool>`, `<|tool_call>`, and `<|tool_response>` special tokens.

### Define tools and call them

```python
from mlx_lm import load, generate
import json

model, tokenizer = load("deadbydawn101/gemma-4-21b-REAP-Tool-Calling-mlx-4bit")

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get the current weather for a location",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {"type": "string", "description": "City and country"},
                    "units": {"type": "string", "enum": ["celsius", "fahrenheit"]}
                },
                "required": ["location"]
            }
        }
    }
]

messages = [{"role": "user", "content": "What's the weather in San Jose, CA?"}]
prompt = tokenizer.apply_chat_template(
    messages,
    tools=tools,
    add_generation_prompt=True,
    tokenize=False
)
response = generate(model, tokenizer, prompt=prompt, max_tokens=256)
# Model responds with a structured tool_call in <|tool_call>...<tool_call|> format
```

### Parse tool calls and feed results back

```python
# After tool execution, feed the result back
messages += [
    {"role": "assistant", "tool_calls": [{"function": {"name": "get_weather", "arguments": {"location": "San Jose, CA"}}}]},
    {"role": "tool", "tool_responses": [{"name": "get_weather", "response": {"temp": 72, "condition": "sunny"}}]}
]
prompt = tokenizer.apply_chat_template(messages, tools=tools, add_generation_prompt=True, tokenize=False)
final = generate(model, tokenizer, prompt=prompt, max_tokens=256)
```

### With mlx_vlm (multimodal + tools)

```python
from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template

model, processor = load("deadbydawn101/gemma-4-21b-REAP-Tool-Calling-mlx-4bit")
prompt = apply_chat_template(
    processor, model.config, messages,
    tools=tools, add_generation_prompt=True
)
```

### Tool token format (native)
| Token | Purpose |
|-------|---------|
| `<\|tool>...<tool\|>` | Tool definition block |
| `<\|tool_call>call:name{args}<tool_call\|>` | Model calls a tool |
| `<\|tool_response>...<tool_response\|>` | Result returned to model |


## 🦙 Ollama — One-Command Setup

### Instant run (no install needed)
```bash
ollama run hf.co/deadbydawn101/gemma-4-21b-REAP-Tool-Calling-mlx-4bit
```

### With a custom system prompt + tool support
Create a `Modelfile`:
```
FROM hf.co/deadbydawn101/gemma-4-21b-REAP-Tool-Calling-mlx-4bit

SYSTEM "You are a helpful assistant with tool-use capabilities. Think through problems step by step."

PARAMETER temperature 0.7
PARAMETER num_ctx 8192
```

```bash
ollama create ravenx-gemma4 -f Modelfile
ollama run ravenx-gemma4
```

### OpenAI-compatible endpoint
```bash
# Ollama exposes an OpenAI-compatible API automatically
curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "hf.co/deadbydawn101/gemma-4-21b-REAP-Tool-Calling-mlx-4bit",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'
```

### Run with mlx_lm server (native, faster on Apple Silicon)
```bash
# mlx_lm server is faster than Ollama for Apple Silicon — uses Metal GPU directly
mlx_lm.server --model deadbydawn101/gemma-4-21b-REAP-Tool-Calling-mlx-4bit --port 8080

# Then use any OpenAI client
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "deadbydawn101/gemma-4-21b-REAP-Tool-Calling-mlx-4bit", "messages": [{"role": "user", "content": "Hello!"}]}'
```

## Conversion Details

- **Source:** `0xSero/gemma-4-21b-a4b-it-REAP` (bfloat16, ~40 GB)
- **Tool:** `mlx_vlm.convert` with `--q-bits 4 --q-group-size 64 --q-mode affine`
- **Result:** ~4.8 bits/weight average, 12 GB output
- **Platform:** Apple M4 Max 128GB

## Related Models

| Model | Size | Description |
|-------|------|-------------|
| [deadbydawn101/gemma-4-E4B-mlx-4bit](https://huggingface.co/deadbydawn101/gemma-4-E4B-mlx-4bit) | 4.86 GB | Standard 4B dense MLX |
| [deadbydawn101/gemma-4-E2B-Heretic-Uncensored-mlx-4bit](https://huggingface.co/deadbydawn101/gemma-4-E2B-Heretic-Uncensored-mlx-4bit) | 3.34 GB | 2B abliterated MLX |
| [0xSero/gemma-4-21b-a4b-it-REAP](https://huggingface.co/0xSero/gemma-4-21b-a4b-it-REAP) | 40 GB | Source bf16 |

## License

[Gemma Terms of Use](https://ai.google.dev/gemma/docs/gemma_4_license) — free for research and commercial use with attribution.

---

*Converted by [deadbydawn101](https://huggingface.co/deadbydawn101) · RavenX AI*


## TriAttention KV Compression

> **[2026-04-09] Our MLX port was merged into [TriAttention](https://github.com/WeianMao/triattention) (MIT + NVIDIA) — PR #1 by [@DeadByDawn101](https://github.com/DeadByDawn101) (RavenX AI).**

Apply **10.7x KV memory reduction** and **2.5x throughput** on top of this model's built-in 4-bit TurboQuant quantization for ~50x combined compression vs full fp16:

```python
from mlx_lm import load
from triattention.mlx import apply_triattention_mlx

model, tokenizer = load("deadbydawn101/gemma-4-21b-REAP-Tool-Calling-mlx-4bit")
apply_triattention_mlx(model, kv_budget=2048)
```

## RavenX Inference Harness

One-command inference, benchmarking, and local OpenAI-compatible server:

```bash
git clone https://github.com/DeadByDawn101/ravenx-inference-harness
cd ravenx-inference-harness

# Inference
python run.py --model deadbydawn101/gemma-4-21b-REAP-Tool-Calling-mlx-4bit --prompt "Your prompt"

# TriAttention compressed
python run.py --model deadbydawn101/gemma-4-21b-REAP-Tool-Calling-mlx-4bit --triattention --kv-budget 2048

# Local OpenAI-compatible server (works with OpenClaw)
python serve.py --model deadbydawn101/gemma-4-21b-REAP-Tool-Calling-mlx-4bit --triattention
```