Instructions to use KikoCis/gemma-4-31b-agent-v6-MLX with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use KikoCis/gemma-4-31b-agent-v6-MLX with MLX:
# Make sure mlx-lm is installed # pip install --upgrade mlx-lm # if on a CUDA device, also pip install mlx[cuda] # Generate text with mlx-lm from mlx_lm import load, generate model, tokenizer = load("KikoCis/gemma-4-31b-agent-v6-MLX") prompt = "Once upon a time in" text = generate(model, tokenizer, prompt=prompt, verbose=True) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- LM Studio
- MLX LM
How to use KikoCis/gemma-4-31b-agent-v6-MLX with MLX LM:
Generate or start a chat session
# Install MLX LM uv tool install mlx-lm # Generate some text mlx_lm.generate --model "KikoCis/gemma-4-31b-agent-v6-MLX" --prompt "Once upon a time"
Gemma 4 31B Agent v6 β MLX
The first local model to match Claude Opus on real-world autonomous tasks
| π 10/10 Bioinformatics | π 10/10 DevOps | π 6/10 Data Engineering |
| β‘ $0 cost Β· 16 GB MLX q4 Β· Runs on Apple Silicon | ||
What This Is
A fine-tuned Gemma 4 31B (q4 MLX) trained on a curated dataset emphasizing:
- Multi-turn agent trajectories (43%) β 10-20 step sequences with error recovery
- Error recovery patterns (11%) β "command fails β adapt β retry"
- Bash/CLI (17%) β real shell commands
- Tool-calling format (24%) β JSON tool use
- Reasoning (10%) β OODA loop, first principles
Why This Exists: The Benchmark Trap
We discovered that 95% BFCL (Berkeley Function Calling Leaderboard) = 0% real agent capability.
A fine-tuned E4B model scored 95.50% on BFCL but 0/10 on our autonomous Docker challenge β entering infinite loops, unable to recover from errors. The unfine-tuned base scored 6/10.
Standard benchmarks test format. Real work tests reasoning.
Real-World Agent Benchmark Results
4 challenges Γ 30 min each, autonomous inside Docker containers:
| Challenge | Claude Opus 4.6 (Cloud, $4.72) |
This Model (Local, $0) |
Comments |
|---|---|---|---|
| 1. Bioinformatics Download P53 from UniProt, parse JSON, extract structure + mutations, HTML report |
9/10 Β· 13 turns ~21K input Β· 13.5K output $1.33 |
10/10 Β· 9 turns ~5K input Β· 1.2K output FREE |
Both fail at same JSON parsing bug. 31B produces larger report (121 KB) |
| 2. Security CTF Deploy DVWA, exploit SQLi + XSS + command injection |
0/10 Β· 64 turns $1.59 |
1/10 Β· 30 turns FREE |
Container lacks sudo. Both fail. Test issue, not model. |
| 3. Data Engineering NYC taxi pipeline: download, clean, analytics, Chart.js dashboard |
9/10 Β· 19 turns 12 charts, dark theme $1.17 |
6/10 Β· 7 turns 3 charts, basic CSS FREE |
Both have correct data. Opus wins on presentation quality. |
| 4. DevOps Flask + Nginx + Prometheus + health check + status page |
10/10 Β· 26 turns 3 services running $0.63 |
10/10 Β· 19 turns 1 service running FREE |
Opus: full infra live. 31B: configs correct, exec limited by permissions. |
| TOTAL | 28/40 122 turns Β· $4.72 |
27/40 65 turns Β· $0.00 |
31B uses 75% fewer tokens 5x slower (local GPU) |
SWE-bench Lite (30 problems, simplified patch matching)
| Model | Accuracy | Time | Cost |
|---|---|---|---|
| Claude Opus 4.6 | 10% (3/30) | 43 min | ~$15 |
| This model | 10% (3/30) | 105 min | $0 |
Same accuracy. Not trained on coding tasks β included as reference baseline.
The Journey: From 0/10 to 10/10
| Model | Size | BFCL | Agent Score | What Happened |
|---|---|---|---|---|
| E4B v3 (BFCL fine-tune) | 4.5 GB | 95.50% | 0/10 | Infinite loop. "The Benchmark Trap" |
| E4B Base | 3.8 GB | 80.25% | 6/10 | Works but shallow attention |
| E4B v5 (reasoning) | 4 GB | TBD | 7/10 | Better reasoning but stops early |
| E4B v6 (multi-turn) | 4 GB | TBD | 0/10 | 42 layers can't sustain attention |
| 31B Base q4 | 16 GB | 92.25% | 9/10 | Already capable β 60 layers help |
| 31B v6 (this model) | 16 GB | TBD | 10/10 | Fine-tune improves quality, not just capability |
Key insight: 4.5B params (42 layers, 8 heads) can't sustain multi-turn agent reasoning. 31B (60 layers, 16 heads) can. The fine-tune adds error recovery and persistence, but the base architecture must be large enough.
Usage
With MLX (Apple Silicon)
from mlx_lm import load, generate
model, tokenizer = load("KikoCis/gemma-4-31b-agent-v6-MLX")
prompt = """You are an autonomous agent. To run commands: {"name": "bash", "arguments": "command"}
TASK: Download and analyze protein P53 from UniProt.
Begin."""
response = generate(model, tokenizer, prompt=prompt, max_tokens=500)
print(response)
As Agent (with agent_runner.py)
# Start MLX server
python3 -m mlx_lm.server --model KikoCis/gemma-4-31b-agent-v6-MLX --port 8095
# Run agent
python3 agent_runner.py \
--api-url http://localhost:8095/v1 \
--model gemma4-31b-v6 \
--prompt "Your task here..."
Training Details
- Base: Gemma 4 31B IT (q4 MLX quantization)
- Method: LoRA rank 8, all 60 layers, mask_prompt
- Dataset: 17,396 examples (proprietary, not published)
- 43% multi-turn agent trajectories
- 17% bash/CLI
- 11% error recovery
- 10% reasoning (OODA, first principles)
- 24% tool-calling format
- Training: 1000 iterations, batch_size=2, lr=3e-5, grad_checkpoint
- Hardware: consumer hardware
- Val loss: 2.263 β 0.378
Limitations
- Not trained on coding β SWE-bench at 10% is a reference baseline, not optimized
- Slow on local GPU β ~30s per response on M4 Max (vs ~2s for Claude API)
- No web search β can't look up documentation when stuck (planned for v7)
- Permissions issues β non-root containers limit what the agent can install/configure
- MLX format only β GGUF conversion pending (Gemma 4 PLE architecture has converter quirks)
Citation
@misc{cisneros2026benchmarktrap,
title={The Benchmark Trap: How 95\% BFCL Produces a 0\% Agent},
author={Cisneros, Kiko and Claude Opus 4.6},
year={2026},
publisher={Utopia IA},
url={https://huggingface.co/KikoCis/gemma-4-31b-agent-v6-MLX}
}
Links
Quantized