Gemma 4 31B Agent v6 — MLX

The first local model to match Claude Opus on real-world autonomous tasks

🏆 10/10 Bioinformatics	🏆 10/10 DevOps	📊 6/10 Data Engineering
⚡ $0 cost · 16 GB MLX q4 · Runs on Apple Silicon

What This Is

A fine-tuned Gemma 4 31B (q4 MLX) trained on a curated dataset emphasizing:

Multi-turn agent trajectories (43%) — 10-20 step sequences with error recovery
Error recovery patterns (11%) — "command fails → adapt → retry"
Bash/CLI (17%) — real shell commands
Tool-calling format (24%) — JSON tool use
Reasoning (10%) — OODA loop, first principles

Why This Exists: The Benchmark Trap

We discovered that 95% BFCL (Berkeley Function Calling Leaderboard) = 0% real agent capability.

A fine-tuned E4B model scored 95.50% on BFCL but 0/10 on our autonomous Docker challenge — entering infinite loops, unable to recover from errors. The unfine-tuned base scored 6/10.

Standard benchmarks test format. Real work tests reasoning.

Real-World Agent Benchmark Results

4 challenges × 30 min each, autonomous inside Docker containers:

Challenge	Claude Opus 4.6 (Cloud, $4.72)	This Model (Local, $0)	Comments
1. Bioinformatics Download P53 from UniProt, parse JSON, extract structure + mutations, HTML report	9/10 · 13 turns ~21K input · 13.5K output $1.33	10/10 · 9 turns ~5K input · 1.2K output FREE	Both fail at same JSON parsing bug. 31B produces larger report (121 KB)
2. Security CTF Deploy DVWA, exploit SQLi + XSS + command injection	0/10 · 64 turns $1.59	1/10 · 30 turns FREE	Container lacks sudo. Both fail. Test issue, not model.
3. Data Engineering NYC taxi pipeline: download, clean, analytics, Chart.js dashboard	9/10 · 19 turns 12 charts, dark theme $1.17	6/10 · 7 turns 3 charts, basic CSS FREE	Both have correct data. Opus wins on presentation quality.
4. DevOps Flask + Nginx + Prometheus + health check + status page	10/10 · 26 turns 3 services running $0.63	10/10 · 19 turns 1 service running FREE	Opus: full infra live. 31B: configs correct, exec limited by permissions.
TOTAL	28/40 122 turns · $4.72	27/40 65 turns · $0.00	31B uses 75% fewer tokens 5x slower (local GPU)

SWE-bench Lite (30 problems, simplified patch matching)

Model	Accuracy	Time	Cost
Claude Opus 4.6	10% (3/30)	43 min	~$15
This model	10% (3/30)	105 min	$0

Same accuracy. Not trained on coding tasks — included as reference baseline.

The Journey: From 0/10 to 10/10

Model	Size	BFCL	Agent Score	What Happened
E4B v3 (BFCL fine-tune)	4.5 GB	95.50%	0/10	Infinite loop. "The Benchmark Trap"
E4B Base	3.8 GB	80.25%	6/10	Works but shallow attention
E4B v5 (reasoning)	4 GB	TBD	7/10	Better reasoning but stops early
E4B v6 (multi-turn)	4 GB	TBD	0/10	42 layers can't sustain attention
31B Base q4	16 GB	92.25%	9/10	Already capable — 60 layers help
31B v6 (this model)	16 GB	TBD	10/10	Fine-tune improves quality, not just capability

Key insight: 4.5B params (42 layers, 8 heads) can't sustain multi-turn agent reasoning. 31B (60 layers, 16 heads) can. The fine-tune adds error recovery and persistence, but the base architecture must be large enough.

Usage

With MLX (Apple Silicon)

from mlx_lm import load, generate

model, tokenizer = load("KikoCis/gemma-4-31b-agent-v6-MLX")

prompt = """You are an autonomous agent. To run commands: {"name": "bash", "arguments": "command"}

TASK: Download and analyze protein P53 from UniProt.
Begin."""

response = generate(model, tokenizer, prompt=prompt, max_tokens=500)
print(response)

As Agent (with agent_runner.py)

# Start MLX server
python3 -m mlx_lm.server --model KikoCis/gemma-4-31b-agent-v6-MLX --port 8095

# Run agent
python3 agent_runner.py \
  --api-url http://localhost:8095/v1 \
  --model gemma4-31b-v6 \
  --prompt "Your task here..."

Training Details

Base: Gemma 4 31B IT (q4 MLX quantization)
Method: LoRA rank 8, all 60 layers, mask_prompt
Dataset: 17,396 examples (proprietary, not published)
- 43% multi-turn agent trajectories
- 17% bash/CLI
- 11% error recovery
- 10% reasoning (OODA, first principles)
- 24% tool-calling format
Training: 1000 iterations, batch_size=2, lr=3e-5, grad_checkpoint
Hardware: consumer hardware
Val loss: 2.263 → 0.378

Limitations

Not trained on coding — SWE-bench at 10% is a reference baseline, not optimized
Slow on local GPU — ~30s per response on M4 Max (vs ~2s for Claude API)
No web search — can't look up documentation when stuck (planned for v7)
Permissions issues — non-root containers limit what the agent can install/configure
MLX format only — GGUF conversion pending (Gemma 4 PLE architecture has converter quirks)

Citation

@misc{cisneros2026benchmarktrap,
  title={The Benchmark Trap: How 95\% BFCL Produces a 0\% Agent},
  author={Cisneros, Kiko and Claude Opus 4.6},
  year={2026},
  publisher={Utopia IA},
  url={https://huggingface.co/KikoCis/gemma-4-31b-agent-v6-MLX}
}

KikoCis
/

gemma-4-31b-agent-v6-MLX