Gemma 4 31B Agent v6 β€” MLX

The first local model to match Claude Opus on real-world autonomous tasks

πŸ† 10/10 BioinformaticsπŸ† 10/10 DevOpsπŸ“Š 6/10 Data Engineering
⚑ $0 cost · 16 GB MLX q4 · Runs on Apple Silicon

What This Is

A fine-tuned Gemma 4 31B (q4 MLX) trained on a curated dataset emphasizing:

  • Multi-turn agent trajectories (43%) β€” 10-20 step sequences with error recovery
  • Error recovery patterns (11%) β€” "command fails β†’ adapt β†’ retry"
  • Bash/CLI (17%) β€” real shell commands
  • Tool-calling format (24%) β€” JSON tool use
  • Reasoning (10%) β€” OODA loop, first principles

Why This Exists: The Benchmark Trap

We discovered that 95% BFCL (Berkeley Function Calling Leaderboard) = 0% real agent capability.

A fine-tuned E4B model scored 95.50% on BFCL but 0/10 on our autonomous Docker challenge β€” entering infinite loops, unable to recover from errors. The unfine-tuned base scored 6/10.

Standard benchmarks test format. Real work tests reasoning.

Real-World Agent Benchmark Results

4 challenges Γ— 30 min each, autonomous inside Docker containers:

Challenge Claude Opus 4.6
(Cloud, $4.72)
This Model
(Local, $0)
Comments
1. Bioinformatics
Download P53 from UniProt, parse JSON, extract structure + mutations, HTML report
9/10 Β· 13 turns
~21K input Β· 13.5K output
$1.33
10/10 Β· 9 turns
~5K input Β· 1.2K output
FREE
Both fail at same JSON parsing bug.
31B produces larger report (121 KB)
2. Security CTF
Deploy DVWA, exploit SQLi + XSS + command injection
0/10 Β· 64 turns
$1.59
1/10 Β· 30 turns
FREE
Container lacks sudo.
Both fail. Test issue, not model.
3. Data Engineering
NYC taxi pipeline: download, clean, analytics, Chart.js dashboard
9/10 Β· 19 turns
12 charts, dark theme
$1.17
6/10 Β· 7 turns
3 charts, basic CSS
FREE
Both have correct data.
Opus wins on presentation quality.
4. DevOps
Flask + Nginx + Prometheus + health check + status page
10/10 Β· 26 turns
3 services running
$0.63
10/10 Β· 19 turns
1 service running
FREE
Opus: full infra live.
31B: configs correct, exec limited by permissions.
TOTAL 28/40
122 turns Β· $4.72
27/40
65 turns Β· $0.00
31B uses 75% fewer tokens
5x slower (local GPU)

SWE-bench Lite (30 problems, simplified patch matching)

Model Accuracy Time Cost
Claude Opus 4.6 10% (3/30) 43 min ~$15
This model 10% (3/30) 105 min $0

Same accuracy. Not trained on coding tasks β€” included as reference baseline.

The Journey: From 0/10 to 10/10

Model Size BFCL Agent Score What Happened
E4B v3 (BFCL fine-tune) 4.5 GB 95.50% 0/10 Infinite loop. "The Benchmark Trap"
E4B Base 3.8 GB 80.25% 6/10 Works but shallow attention
E4B v5 (reasoning) 4 GB TBD 7/10 Better reasoning but stops early
E4B v6 (multi-turn) 4 GB TBD 0/10 42 layers can't sustain attention
31B Base q4 16 GB 92.25% 9/10 Already capable β€” 60 layers help
31B v6 (this model) 16 GB TBD 10/10 Fine-tune improves quality, not just capability

Key insight: 4.5B params (42 layers, 8 heads) can't sustain multi-turn agent reasoning. 31B (60 layers, 16 heads) can. The fine-tune adds error recovery and persistence, but the base architecture must be large enough.

Usage

With MLX (Apple Silicon)

from mlx_lm import load, generate

model, tokenizer = load("KikoCis/gemma-4-31b-agent-v6-MLX")

prompt = """You are an autonomous agent. To run commands: {"name": "bash", "arguments": "command"}

TASK: Download and analyze protein P53 from UniProt.
Begin."""

response = generate(model, tokenizer, prompt=prompt, max_tokens=500)
print(response)

As Agent (with agent_runner.py)

# Start MLX server
python3 -m mlx_lm.server --model KikoCis/gemma-4-31b-agent-v6-MLX --port 8095

# Run agent
python3 agent_runner.py \
  --api-url http://localhost:8095/v1 \
  --model gemma4-31b-v6 \
  --prompt "Your task here..."

Training Details

  • Base: Gemma 4 31B IT (q4 MLX quantization)
  • Method: LoRA rank 8, all 60 layers, mask_prompt
  • Dataset: 17,396 examples (proprietary, not published)
    • 43% multi-turn agent trajectories
    • 17% bash/CLI
    • 11% error recovery
    • 10% reasoning (OODA, first principles)
    • 24% tool-calling format
  • Training: 1000 iterations, batch_size=2, lr=3e-5, grad_checkpoint
  • Hardware: consumer hardware
  • Val loss: 2.263 β†’ 0.378

Limitations

  • Not trained on coding β€” SWE-bench at 10% is a reference baseline, not optimized
  • Slow on local GPU β€” ~30s per response on M4 Max (vs ~2s for Claude API)
  • No web search β€” can't look up documentation when stuck (planned for v7)
  • Permissions issues β€” non-root containers limit what the agent can install/configure
  • MLX format only β€” GGUF conversion pending (Gemma 4 PLE architecture has converter quirks)

Citation

@misc{cisneros2026benchmarktrap,
  title={The Benchmark Trap: How 95\% BFCL Produces a 0\% Agent},
  author={Cisneros, Kiko and Claude Opus 4.6},
  year={2026},
  publisher={Utopia IA},
  url={https://huggingface.co/KikoCis/gemma-4-31b-agent-v6-MLX}
}

Links

Downloads last month

-

Downloads are not tracked for this model. How to track
MLX
Hardware compatibility
Log In to add your hardware

Quantized

Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Dataset used to train KikoCis/gemma-4-31b-agent-v6-MLX