--- library_name: mlx license: gemma license_link: https://ai.google.dev/gemma/docs/gemma_4_license pipeline_tag: image-text-to-text tags: - mlx - safetensors - gemma4 - moe - pruning - reap - expert-pruning - 4-bit - quantized - apple-silicon - multimodal - vision - cerebras - turboquant - kv-cache-compression - long-context - ravenx - tool-calling - function-calling - ollama base_model: 0xSero/gemma-4-21b-a4b-it-REAP base_model_relation: quantized language: - en --- # Gemma 4 21B REAP โ€” Tool Calling โœ… | 103 Experts | MLX 4-bit | Apple Silicon > **The biggest open Gemma 4 on Apple Silicon. 21B MoE, REAP-pruned to 4 active experts, native tool calling, 12 GB MLX 4-bit. TurboQuant ready.** **[0xSero/gemma-4-21b-a4b-it-REAP](https://huggingface.co/0xSero/gemma-4-21b-a4b-it-REAP)** converted to **MLX 4-bit** (affine, group_size=64) for native Apple Silicon inference. This is **Gemma 4 27B MoE** pruned from 103 experts down to **4 active experts per token** using [REAP (Routing Expert Activation Pruning)](https://arxiv.org/abs/2510.13999) โ€” yielding a model that runs at ~21B total params but activates only a fraction per forward pass, combining large capacity with fast inference. > ๐Ÿ–ค **12 GB MLX 4-bit** โ€” runs on any M-series Mac with 24GB+ unified memory. > Multimodal: text + vision. 131K context window. ## What is REAP? REAP (Routing Expert Activation Pruning) is a technique from Cerebras that prunes MoE experts by analyzing routing patterns. Instead of activating many experts per token, REAP identifies which experts are actually essential and prunes the rest โ€” resulting in: - **Fewer experts activated per token** (4 active out of 103 total) - **Faster inference** due to reduced compute per forward pass - **Minimal quality loss** โ€” BoolQ accuracy 76%, HellaSwag 46% (see evals below) ## Model Details | Property | Value | |----------|-------| | **Base model** | 0xSero/gemma-4-21b-a4b-it-REAP | | **Original base** | google/gemma-4-27b-it (MoE) | | **Architecture** | Gemma4ForConditionalGeneration (MoE) | | **Total parameters** | ~21B | | **Total experts** | 103 | | **Active experts/token** | 4 (REAP-pruned) | | **Modalities** | Text ยท Vision | | **Quantization** | 4-bit affine, group_size=64, ~4.8 bits/weight | | **File size** | **12 GB** (down from ~40 GB bf16) | | **Context window** | 131,072 tokens | | **Vocab size** | 262,144 | ## Evaluation (from source model) | Benchmark | Score | |-----------|------:| | BoolQ | 76% | | HellaSwag | 46% | | ARC-Challenge | 28% | ## Performance (Apple Silicon) | Chip | RAM | Tok/sec (est) | |------|-----|--------------| | M4 Max 128GB | 128GB | ~20โ€“30 tok/s | | M3 Ultra 192GB | 192GB | ~25โ€“35 tok/s | | M2 Ultra 192GB | 192GB | ~18โ€“25 tok/s | > Requires at least **24GB unified memory**. 32GB+ recommended for comfortable operation. ## Quickstart ### Install ```bash pip install mlx-lm mlx-vlm ``` ### Text generation ```python from mlx_vlm import load, generate from mlx_vlm.prompt_utils import apply_chat_template model, processor = load("deadbydawn101/gemma-4-21b-REAP-Tool-Calling-mlx-4bit") messages = [{"role": "user", "content": "Explain mixture-of-experts models simply."}] prompt = apply_chat_template(processor, model.config, messages) response = generate(model, processor, prompt=prompt, max_tokens=512, verbose=True) ``` ### Vision (image + text) ```python from mlx_vlm import load, generate from mlx_vlm.prompt_utils import apply_chat_template model, processor = load("deadbydawn101/gemma-4-21b-REAP-Tool-Calling-mlx-4bit") messages = [{ "role": "user", "content": [ {"type": "image", "image": "https://example.com/photo.jpg"}, {"type": "text", "text": "Describe this image."} ] }] prompt = apply_chat_template(processor, model.config, messages, add_generation_prompt=True) response = generate(model, processor, prompt=prompt, max_tokens=512) ``` ### CLI ```bash mlx_vlm.generate \ --model deadbydawn101/gemma-4-21b-REAP-Tool-Calling-mlx-4bit \ --prompt "What are the key differences between MoE and dense transformer models?" \ --max-tokens 512 ``` ## โšก TurboQuant-MLX โ€” 4.6x KV Cache Compression Pair this model with **[TurboQuant-MLX](https://github.com/DeadByDawn101/turboquant-mlx)** โ€” RavenX AI's Apple Silicon KV cache compression. Run **4.6x longer contexts** with near-zero accuracy loss by compressing the KV cache using PolarQuant + QJL residuals. ```python from turboquant_mlx.mlx_kvcache import TurboQuantKVCache import mlx_lm.models.cache as cache_module # Patch mlx-lm to use TurboQuant compression cache_module.make_prompt_cache = lambda model, **kw: [ TurboQuantKVCache() for _ in range(len(model.layers)) ] # Now load and run as normal โ€” context is compressed automatically from mlx_vlm import load, generate model, processor = load("deadbydawn101/gemma-4-21b-REAP-Tool-Calling-mlx-4bit") ``` | Without TurboQuant | With TurboQuant | |---|---| | 8K context @ 12 GB | 36K context @ ~12 GB | | KV cache grows linearly | KV cache stays compressed | โ†’ [TurboQuant-MLX on GitHub](https://github.com/DeadByDawn101/turboquant-mlx) ยท [Release v2.0](https://github.com/DeadByDawn101/turboquant-mlx/releases/tag/v2.0.0) ## ๐Ÿง  Opus Reasoning + Claude Code LoRA Supercharge this model with the **[Opus Reasoning + Claude Code LoRA](https://huggingface.co/deadbydawn101/gemma-4-E4B-opus-reasoning-claude-code-lora)** โ€” trained on Claude Opus 4.6 reasoning traces and Claude Code tool-use patterns. Apply it to get structured ``-tag chain-of-thought reasoning and agentic tool-use behavior: ```python from mlx_vlm import load, generate model, processor = load( "deadbydawn101/gemma-4-21b-REAP-Tool-Calling-mlx-4bit", adapter_path="deadbydawn101/gemma-4-E4B-opus-reasoning-claude-code-lora", ) ``` | What it adds | Detail | |---|---| | **Reasoning style** | `` tag chain-of-thought before every answer | | **Training data** | Claude Opus 4.6 reasoning traces (2,054 examples) | | **Tool-use patterns** | 140 Claude Code agentic pattern files | | **Size** | 658 MB adapter on top of base model | โ†’ [View the adapter repo](https://huggingface.co/deadbydawn101/gemma-4-E4B-opus-reasoning-claude-code-lora) ## ๐Ÿ’ป Gemini CLI โ€” Coding Agent + Tool Orchestration We use **[RavenX AI's Gemini CLI fork](https://github.com/DeadByDawn101/gemini-cli)** as the coding agent and tool orchestration layer on top of these models. This is what makes the tool-calling capability real in production. Gemini CLI gives you a full agentic loop in the terminal โ€” Google Search grounding, file read/write, shell execution, web fetching, and MCP server support โ€” all wired to a 1M token context window. ```bash # Install npm install -g @google/gemini-cli # Run as a coding agent against this model (via local mlx_lm server) mlx_lm.server --model deadbydawn101/gemma-4-21b-REAP-Tool-Calling-mlx-4bit --port 8080 & gemini --baseUrl http://localhost:8080 # Or use directly against Gemini API (free tier: 60 req/min) gemini ``` ### What Gemini CLI + these models unlock together | Capability | How | |---|---| | **Code generation** | Gemini CLI reads your codebase, model reasons with `` tags | | **Tool calling** | Native `<\|tool>` tokens โ†’ Gemini CLI executes shell/file/web tools | | **Long context** | 1M ctx in CLI + TurboQuant 4.6x KV compression = very long sessions | | **MCP servers** | Connect any MCP server โ€” databases, APIs, custom tools | | **Search grounding** | Google Search built in โ€” model gets live data | ```bash # Real example: code review with tool calling enabled gemini --baseUrl http://localhost:8080 \ "Review all Python files in ./src, find potential bugs, and suggest fixes" # Gemini CLI will: read files โ†’ call tools โ†’ model reasons โ†’ produce structured output ``` โ†’ [DeadByDawn101/gemini-cli on GitHub](https://github.com/DeadByDawn101/gemini-cli) โ€” Apache 2.0, free tier, MCP-compatible ## ๐Ÿ› ๏ธ Tool Calling (Function Calling) **Gemma 4 has native tool calling built into its chat template.** Most models on HuggingFace don't support this โ€” Gemma 4 does, using `<|tool>`, `<|tool_call>`, and `<|tool_response>` special tokens. ### Define tools and call them ```python from mlx_lm import load, generate import json model, tokenizer = load("deadbydawn101/gemma-4-21b-REAP-Tool-Calling-mlx-4bit") tools = [ { "type": "function", "function": { "name": "get_weather", "description": "Get the current weather for a location", "parameters": { "type": "object", "properties": { "location": {"type": "string", "description": "City and country"}, "units": {"type": "string", "enum": ["celsius", "fahrenheit"]} }, "required": ["location"] } } } ] messages = [{"role": "user", "content": "What's the weather in San Jose, CA?"}] prompt = tokenizer.apply_chat_template( messages, tools=tools, add_generation_prompt=True, tokenize=False ) response = generate(model, tokenizer, prompt=prompt, max_tokens=256) # Model responds with a structured tool_call in <|tool_call>... format ``` ### Parse tool calls and feed results back ```python # After tool execution, feed the result back messages += [ {"role": "assistant", "tool_calls": [{"function": {"name": "get_weather", "arguments": {"location": "San Jose, CA"}}}]}, {"role": "tool", "tool_responses": [{"name": "get_weather", "response": {"temp": 72, "condition": "sunny"}}]} ] prompt = tokenizer.apply_chat_template(messages, tools=tools, add_generation_prompt=True, tokenize=False) final = generate(model, tokenizer, prompt=prompt, max_tokens=256) ``` ### With mlx_vlm (multimodal + tools) ```python from mlx_vlm import load, generate from mlx_vlm.prompt_utils import apply_chat_template model, processor = load("deadbydawn101/gemma-4-21b-REAP-Tool-Calling-mlx-4bit") prompt = apply_chat_template( processor, model.config, messages, tools=tools, add_generation_prompt=True ) ``` ### Tool token format (native) | Token | Purpose | |-------|---------| | `<\|tool>...` | Tool definition block | | `<\|tool_call>call:name{args}` | Model calls a tool | | `<\|tool_response>...` | Result returned to model | ## ๐Ÿฆ™ Ollama โ€” One-Command Setup ### Instant run (no install needed) ```bash ollama run hf.co/deadbydawn101/gemma-4-21b-REAP-Tool-Calling-mlx-4bit ``` ### With a custom system prompt + tool support Create a `Modelfile`: ``` FROM hf.co/deadbydawn101/gemma-4-21b-REAP-Tool-Calling-mlx-4bit SYSTEM "You are a helpful assistant with tool-use capabilities. Think through problems step by step." PARAMETER temperature 0.7 PARAMETER num_ctx 8192 ``` ```bash ollama create ravenx-gemma4 -f Modelfile ollama run ravenx-gemma4 ``` ### OpenAI-compatible endpoint ```bash # Ollama exposes an OpenAI-compatible API automatically curl http://localhost:11434/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "hf.co/deadbydawn101/gemma-4-21b-REAP-Tool-Calling-mlx-4bit", "messages": [{"role": "user", "content": "Hello!"}] }' ``` ### Run with mlx_lm server (native, faster on Apple Silicon) ```bash # mlx_lm server is faster than Ollama for Apple Silicon โ€” uses Metal GPU directly mlx_lm.server --model deadbydawn101/gemma-4-21b-REAP-Tool-Calling-mlx-4bit --port 8080 # Then use any OpenAI client curl http://localhost:8080/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{"model": "deadbydawn101/gemma-4-21b-REAP-Tool-Calling-mlx-4bit", "messages": [{"role": "user", "content": "Hello!"}]}' ``` ## Conversion Details - **Source:** `0xSero/gemma-4-21b-a4b-it-REAP` (bfloat16, ~40 GB) - **Tool:** `mlx_vlm.convert` with `--q-bits 4 --q-group-size 64 --q-mode affine` - **Result:** ~4.8 bits/weight average, 12 GB output - **Platform:** Apple M4 Max 128GB ## Related Models | Model | Size | Description | |-------|------|-------------| | [deadbydawn101/gemma-4-E4B-mlx-4bit](https://huggingface.co/deadbydawn101/gemma-4-E4B-mlx-4bit) | 4.86 GB | Standard 4B dense MLX | | [deadbydawn101/gemma-4-E2B-Heretic-Uncensored-mlx-4bit](https://huggingface.co/deadbydawn101/gemma-4-E2B-Heretic-Uncensored-mlx-4bit) | 3.34 GB | 2B abliterated MLX | | [0xSero/gemma-4-21b-a4b-it-REAP](https://huggingface.co/0xSero/gemma-4-21b-a4b-it-REAP) | 40 GB | Source bf16 | ## License [Gemma Terms of Use](https://ai.google.dev/gemma/docs/gemma_4_license) โ€” free for research and commercial use with attribution. --- *Converted by [deadbydawn101](https://huggingface.co/deadbydawn101) ยท RavenX AI* ## TriAttention KV Compression > **[2026-04-09] Our MLX port was merged into [TriAttention](https://github.com/WeianMao/triattention) (MIT + NVIDIA) โ€” PR #1 by [@DeadByDawn101](https://github.com/DeadByDawn101) (RavenX AI).** Apply **10.7x KV memory reduction** and **2.5x throughput** on top of this model's built-in 4-bit TurboQuant quantization for ~50x combined compression vs full fp16: ```python from mlx_lm import load from triattention.mlx import apply_triattention_mlx model, tokenizer = load("deadbydawn101/gemma-4-21b-REAP-Tool-Calling-mlx-4bit") apply_triattention_mlx(model, kv_budget=2048) ``` ## RavenX Inference Harness One-command inference, benchmarking, and local OpenAI-compatible server: ```bash git clone https://github.com/DeadByDawn101/ravenx-inference-harness cd ravenx-inference-harness # Inference python run.py --model deadbydawn101/gemma-4-21b-REAP-Tool-Calling-mlx-4bit --prompt "Your prompt" # TriAttention compressed python run.py --model deadbydawn101/gemma-4-21b-REAP-Tool-Calling-mlx-4bit --triattention --kv-budget 2048 # Local OpenAI-compatible server (works with OpenClaw) python serve.py --model deadbydawn101/gemma-4-21b-REAP-Tool-Calling-mlx-4bit --triattention ```