How to use from the
Use from the
llama-cpp-python library
# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="deucebucket/Gemma-4-26B-A4B-it-Cerebellum-v6-GGUF",
	filename="",
)
llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Cerebellum

Gemma 4 26B-A4B-it Cerebellum GGUF

This repository contains GGUF builds derived from google/gemma-4-26B-A4B-it.

2026-05-22 Update

Added:

gemma-4-26B-A4B-it-cerebellum-v6.1-templatefix-Q3_K_M.gguf
sha256: d24229facdef8360a7ffa8b37a50e1de636b9139a5eba0efe899828e45ae7989

gemma-4-26b-a4b-it.mmproj.gguf
sha256: b762c43119ebdc3e3c36d929d958e827fac35b03278dda9203f87131aee1f185

The v6.1 file keeps the v6 tensor allocation and updates GGUF/runtime-facing metadata for Gemma 4 chat-template use. The update was tested with llama-server --jinja --reasoning auto and request-level no-thinking controls.

Older files in this repository are retained for reproducibility.

Measured launch (RTX 3090, llama.cpp)

Measured 2026-06-13 on a single RTX 3090 (24 GB), one llama-server, KV cache q8_0:

metric measured
decode speed 123 tok/s
peak VRAM (4-slot serving) 15.1 GB
max measured context (q8_0 KV) 131,072
llama-server -m gemma-4-26B-A4B-it-cerebellum-v6.1-templatefix-Q3_K_M.gguf \
  -ngl 99 --parallel 4 -c 24576 --jinja --reasoning-budget 0

This rig's measurements; no quality claims beyond them.

Tested Runtime

Runtime used for the 2026-05-22 templatefix checks:

llama.cpp fork: https://github.com/deucebucket/llama.cpp
branch: cerebellum/gemma4-runtime-fixes
fork commit: ded491334 fix: harden Gemma 4 server budgets
base build: b8930-59fa0b455

Server shape used locally:

llama-server \
  --model gemma-4-26B-A4B-it-cerebellum-v6.1-templatefix-Q3_K_M.gguf \
  --mmproj gemma-4-26b-a4b-it.mmproj.gguf \
  --n-gpu-layers 99 \
  --ctx-size 65536 \
  --parallel 1 \
  --flash-attn on \
  --cache-type-k q8_0 \
  --cache-type-v q8_0 \
  --jinja \
  --reasoning auto \
  --media-path /tmp/

Normal no-thinking requests used:

{
  "chat_template_kwargs": {"enable_thinking": false},
  "thinking_budget_tokens": 0
}

Bounded-thinking smoke requests used thinking_budget_tokens: 128.

2026-05-22 Templatefix Test Artifacts

Creative-writing smoke files:

creative_eval_20260522/regular_v6_1_templatefix_creative_summary.json
creative_eval_20260522/regular_v6_1_templatefix_creative_rerun_longcaps_summary.json

Non-coding tool-use files:

agentic_eval_20260522/README.md
agentic_eval_20260522/regular_v6_1_noncoding_agentic_tools_strict_summary.json

Observed 2026-05-22 results from those artifacts:

Area Harness Observed result
No-thinking output channel six creative prompts reasoning_len=0 in recorded outputs
Template leakage markers six creative prompts no <think> marker or template marker recorded by checker
Creative long-cap rerun four prompts rerun after initial length caps four stop finishes in rerun summary
Non-coding tool workflow three strict OpenAI-style tool tasks schedule_strict, release_notes_strict, creative_brief_strict listed in pass_cases

The non-coding tool harness used mock tools named list_calendar, create_calendar_hold, search_notes, save_note, and add_task. It did not test code editing.

Evaluation

Benchmark results for the Cerebellum v6 tensor allocation, measured directly on the GGUF with llama.cpp llama-server on an RTX 3090. The v6.1 templatefix file keeps the v6 tensor allocation with zero tensor changes (metadata-only update), so these measurements describe the same weights. Summary JSONs are in benchmark_results/ in this repository.

Benchmark Cerebellum v6 (11 GB) Local Q3_K_M baseline
ARC-Challenge 95.56% (1172 q) 95.22%
HellaSwag 84.55% (10042 q) 86.57%
MMLU-Redux 71.33% (2400 q) 73.67%

Protocol: multiple-choice benchmarks run against a local llama-server with the project benchmark harness at temperature 0. HumanEval is not listed in the metadata because the retained v6 HumanEval artifacts are marked for audit in local notes. For Gemma 4, the current HumanEval/EvalPlus protocol uses the chat-completions harness (scripts/benchmark_evalplus_chat.py) with enable_thinking: false, thinking_budget_tokens: 0, and BENCH_WORKERS=1, not raw completions.

Historical Same-Repo Benchmark Artifacts

The following benchmark artifacts are from the earlier v6 line and the local Q3_K_M baseline. They are included as historical same-project measurements, not as new v6.1 measurements.

Artifact set ARC-Challenge HellaSwag MMLU-Redux HumanEval note
q3km_baseline_* 95.2218 86.5664 73.6667 q3km_baseline_humaneval_results.json: 62.2 pass@1
cerebellum_v6_* 95.5631 84.55 71.3333 v6 HumanEval artifacts are retained but marked for audit in local notes

For Gemma 4 HumanEval/EvalPlus, the local protocol now uses chat completions, not raw completions:

llama-server --jinja --reasoning auto
chat_template_kwargs: {"enable_thinking": false}
thinking_budget_tokens: 0
BENCH_WORKERS=1

Files and Provenance

Main v6.1 GGUF:

source base: google/gemma-4-26B-A4B-it
quantization family: mixed-precision GGUF
recipe lineage: Cerebellum v6 tensor allocation
base quant lineage: Q3_K_M with bartowski imatrix

Matching mmproj:

gemma-4-26b-a4b-it.mmproj.gguf

Notes

  • The 2026-05-22 tests were run on local llama-server.
  • The opencode coding-agent test is not used as a model-card result. In one internal White and Black project run, the model connected through the harness and ran a Godot test, then produced malformed edit-tool calls.
  • The creative-writing checks are smoke tests plus mechanical checks, not a human preference benchmark.
  • The non-coding tool checks use mocked tools and fixed task definitions.

Credits

  • Base model: Google Gemma Team, google/gemma-4-26B-A4B-it
  • Imatrix source used in the v6 lineage: bartowski, bartowski/google_gemma-4-26B-A4B-it-GGUF
  • GGUF/runtime: llama.cpp
  • Method and quantization workflow: deucebucket/osmosis Cerebellum pipeline
  • Local test artifacts: deucebucket Cerebellum workflow
Downloads last month
1,989
GGUF
Model size
25B params
Architecture
gemma4
Hardware compatibility
Log In to add your hardware

3-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for deucebucket/Gemma-4-26B-A4B-it-Cerebellum-v6-GGUF

Quantized
(264)
this model

Evaluation results