Text Generation
GGUF
GGUF
gemma4
gemma
google
quantized
cerebellum
imatrix
Mixture of Experts
3-bit
templatefix
conversational
Instructions to use deucebucket/Gemma-4-26B-A4B-it-Cerebellum-v6-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use deucebucket/Gemma-4-26B-A4B-it-Cerebellum-v6-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="deucebucket/Gemma-4-26B-A4B-it-Cerebellum-v6-GGUF", filename="gemma-4-26B-A4B-it-cerebellum-v6.1-templatefix.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use deucebucket/Gemma-4-26B-A4B-it-Cerebellum-v6-GGUF with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf deucebucket/Gemma-4-26B-A4B-it-Cerebellum-v6-GGUF:F16 # Run inference directly in the terminal: llama-cli -hf deucebucket/Gemma-4-26B-A4B-it-Cerebellum-v6-GGUF:F16
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf deucebucket/Gemma-4-26B-A4B-it-Cerebellum-v6-GGUF:F16 # Run inference directly in the terminal: llama-cli -hf deucebucket/Gemma-4-26B-A4B-it-Cerebellum-v6-GGUF:F16
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf deucebucket/Gemma-4-26B-A4B-it-Cerebellum-v6-GGUF:F16 # Run inference directly in the terminal: ./llama-cli -hf deucebucket/Gemma-4-26B-A4B-it-Cerebellum-v6-GGUF:F16
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf deucebucket/Gemma-4-26B-A4B-it-Cerebellum-v6-GGUF:F16 # Run inference directly in the terminal: ./build/bin/llama-cli -hf deucebucket/Gemma-4-26B-A4B-it-Cerebellum-v6-GGUF:F16
Use Docker
docker model run hf.co/deucebucket/Gemma-4-26B-A4B-it-Cerebellum-v6-GGUF:F16
- LM Studio
- Jan
- vLLM
How to use deucebucket/Gemma-4-26B-A4B-it-Cerebellum-v6-GGUF with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "deucebucket/Gemma-4-26B-A4B-it-Cerebellum-v6-GGUF" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "deucebucket/Gemma-4-26B-A4B-it-Cerebellum-v6-GGUF", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/deucebucket/Gemma-4-26B-A4B-it-Cerebellum-v6-GGUF:F16
- Ollama
How to use deucebucket/Gemma-4-26B-A4B-it-Cerebellum-v6-GGUF with Ollama:
ollama run hf.co/deucebucket/Gemma-4-26B-A4B-it-Cerebellum-v6-GGUF:F16
- Unsloth Studio
How to use deucebucket/Gemma-4-26B-A4B-it-Cerebellum-v6-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for deucebucket/Gemma-4-26B-A4B-it-Cerebellum-v6-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for deucebucket/Gemma-4-26B-A4B-it-Cerebellum-v6-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for deucebucket/Gemma-4-26B-A4B-it-Cerebellum-v6-GGUF to start chatting
- Pi
How to use deucebucket/Gemma-4-26B-A4B-it-Cerebellum-v6-GGUF with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf deucebucket/Gemma-4-26B-A4B-it-Cerebellum-v6-GGUF:F16
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "deucebucket/Gemma-4-26B-A4B-it-Cerebellum-v6-GGUF:F16" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use deucebucket/Gemma-4-26B-A4B-it-Cerebellum-v6-GGUF with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf deucebucket/Gemma-4-26B-A4B-it-Cerebellum-v6-GGUF:F16
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default deucebucket/Gemma-4-26B-A4B-it-Cerebellum-v6-GGUF:F16
Run Hermes
hermes
- Docker Model Runner
How to use deucebucket/Gemma-4-26B-A4B-it-Cerebellum-v6-GGUF with Docker Model Runner:
docker model run hf.co/deucebucket/Gemma-4-26B-A4B-it-Cerebellum-v6-GGUF:F16
- Lemonade
How to use deucebucket/Gemma-4-26B-A4B-it-Cerebellum-v6-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull deucebucket/Gemma-4-26B-A4B-it-Cerebellum-v6-GGUF:F16
Run and chat with the model
lemonade run user.Gemma-4-26B-A4B-it-Cerebellum-v6-GGUF-F16
List all available models
lemonade list
docs: update templatefix test notes
Browse files
README.md
CHANGED
|
@@ -3,11 +3,11 @@ license: gemma
|
|
| 3 |
library_name: gguf
|
| 4 |
base_model: google/gemma-4-26B-A4B-it
|
| 5 |
base_model_relation: quantized
|
| 6 |
-
model_name: Gemma-4-26B-A4B-it-Cerebellum-v6-GGUF
|
| 7 |
model_creator: google
|
| 8 |
model_type: gemma4
|
| 9 |
quantized_by: deucebucket
|
| 10 |
-
pipeline_tag:
|
| 11 |
tags:
|
| 12 |
- GGUF
|
| 13 |
- gemma4
|
|
@@ -18,239 +18,149 @@ tags:
|
|
| 18 |
- imatrix
|
| 19 |
- moe
|
| 20 |
- 3-bit
|
| 21 |
-
-
|
| 22 |
-
- multimodal
|
| 23 |
-
- vision
|
| 24 |
---
|
| 25 |
|
| 26 |
-
# Gemma 4 26B-A4B-it
|
| 27 |
|
| 28 |
-
|
|
|
|
| 29 |
|
| 30 |
-
|
| 31 |
|
| 32 |
-
|
| 33 |
|
| 34 |
-
|
|
|
|
|
|
|
| 35 |
|
| 36 |
-
|
| 37 |
-
|
| 38 |
-
|
| 39 |
-
|---|---|
|
| 40 |
-
| **File** | `gemma-4-26B-A4B-it-cerebellum-v6.gguf` |
|
| 41 |
-
| **mmproj** | `mmproj-google_gemma-4-26B-A4B-it-f16.gguf` |
|
| 42 |
-
| **Size** | 11.7 GB (backbone) + 1.14 GB (mmproj) |
|
| 43 |
-
| **Base model** | `google/gemma-4-26B-A4B-it` |
|
| 44 |
-
| **Base quant** | Q3_K_M with [bartowski's imatrix](https://huggingface.co/bartowski/google_gemma-4-26B-A4B-it-GGUF) |
|
| 45 |
-
| **Format** | GGUF, mixed precision |
|
| 46 |
-
| **Test hardware** | RTX 3090, llama.cpp |
|
| 47 |
-
|
| 48 |
-
## Benchmarks
|
| 49 |
-
|
| 50 |
-
| Benchmark | Result |
|
| 51 |
-
|-----------|:------:|
|
| 52 |
-
| WikiText PPL | 12,054 |
|
| 53 |
-
| HumanEval pass@1 | 72.0% |
|
| 54 |
-
| ARC-Challenge | 95.6% |
|
| 55 |
-
| HellaSwag | 84.7% |
|
| 56 |
-
| MMLU-Redux | 71.2% |
|
| 57 |
-
|
| 58 |
-
All results measured locally on an RTX 3090 with llama.cpp. PPL was measured on the WikiText-2 test set with 2048 context and 128 chunks.
|
| 59 |
-
|
| 60 |
-
PPL is high in absolute terms for this model. This appears consistent across Gemma 4 26B quant levels tested locally and may reflect the model's MoE routing behavior on WikiText specifically.
|
| 61 |
-
|
| 62 |
-
## What Changed: v1 Through v6
|
| 63 |
-
|
| 64 |
-
Each version added a new layer of ablation data. The method is always the same: change one thing, measure PPL, keep it only if it helps.
|
| 65 |
-
|
| 66 |
-
| Version | PPL | HumanEval | What Changed |
|
| 67 |
-
|---------|-----|-----------|-------------|
|
| 68 |
-
| v1 | 20,614 | 65.2% | Group-level ablation: 5 tensor groups tested at Q2_K |
|
| 69 |
-
| v2 | 19,826 | 65.9% | + attn_q per-layer ablation (30 layers tested, 9 promoted to Q5_K) |
|
| 70 |
-
| v3 | 19,826 | 67.1% | + PLE protection (norms/scales forced to F32) |
|
| 71 |
-
| v4 | 12,614 | 69.5% | + ffn_up per-layer ablation + precision rebalance |
|
| 72 |
-
| v5 | 12,356 | 71.3% | + attn_k reverse ablation (30 layers tested, 7 promoted to Q3_K) |
|
| 73 |
-
| **v6** | **12,054** | **72.0%** | + MoE router surgery: layer 8 ffn_gate_inp F32→Q8_0 |
|
| 74 |
-
|
| 75 |
-
## How Cerebellum Works
|
| 76 |
-
|
| 77 |
-
Cerebellum assigns quantization precision per tensor based on measured impact. Each tensor group and individual layer is tested by changing its precision and measuring perplexity. Only changes that improve or maintain quality are kept.
|
| 78 |
-
|
| 79 |
-
### Group Ablation
|
| 80 |
-
|
| 81 |
-
Each tensor category was tested at Q2_K and measured by PPL impact:
|
| 82 |
-
|
| 83 |
-
| Group | Tensors | PPL Delta | Action |
|
| 84 |
-
|-------|---------|-----------|--------|
|
| 85 |
-
| attn_q | 30 | +13.4% | Per-layer testing (9 layers need Q5_K) |
|
| 86 |
-
| ffn_gate | 30 | -1.2% | Left at Q3_K |
|
| 87 |
-
| expert_gate_up | 30 | -5.5% | Set to Q2_K |
|
| 88 |
-
| attn_k | 30 | -12.1% | Per-layer testing (7 layers benefit from Q3_K) |
|
| 89 |
-
| ffn_up | 30 | -18.2% | Set to Q2_K |
|
| 90 |
-
|
| 91 |
-
Three of five tested groups had lower PPL at Q2_K — meaning Q3_K_M was using bits on tensors that don't need them.
|
| 92 |
-
|
| 93 |
-
### Layer Ablation
|
| 94 |
-
|
| 95 |
-
Groups with mixed results were tested per layer:
|
| 96 |
-
|
| 97 |
-
- **attn_q**: All 30 layers tested individually at Q2_K. 9 layers exceeded the sensitivity threshold and stay at Q5_K. The other 21 tolerate Q2_K.
|
| 98 |
-
- **attn_k**: All 30 layers tested individually. 7 layers showed PPL improvement when promoted from Q2_K to Q3_K (layer 23: -3.8%, layer 18: -2.8%). 4 layers (5, 11, 16, 29) were confirmed better at Q2_K.
|
| 99 |
-
|
| 100 |
-
### MoE Router Surgery (New in v6)
|
| 101 |
-
|
| 102 |
-
llama-quantize ignores `--tensor-type-file` overrides for `ffn_gate_inp.weight` (MoE router) tensors. We built [gguf_tensor_surgery.py](https://github.com/deucebucket/osmosis/blob/master/scripts/gguf_tensor_surgery.py) to recast individual tensors directly in the GGUF file.
|
| 103 |
-
|
| 104 |
-
All 30 router layers were tested individually at Q8_0 (F32→Q8_0):
|
| 105 |
-
|
| 106 |
-
| Layer | PPL | Delta | Category |
|
| 107 |
-
|-------|------|-------|----------|
|
| 108 |
-
| 8 | 12,054 | -2.4% | Best universal candidate |
|
| 109 |
-
| 10 | 11,872 | -3.9% | Best PPL but regresses HumanEval (-9.7%) |
|
| 110 |
-
| 6 | 11,988 | -3.0% | Win (not stacked — routing compensation) |
|
| 111 |
-
| 9 | 12,044 | -2.5% | Win (not stacked) |
|
| 112 |
-
| 12 | 12,041 | -2.5% | Win (not stacked) |
|
| 113 |
-
| 23 | 12,052 | -2.5% | Win (not stacked) |
|
| 114 |
-
| 0 | 12,974 | +5.0% | Sensitive |
|
| 115 |
-
| 1 | 13,525 | +9.5% | Very sensitive |
|
| 116 |
-
| 2 | 13,239 | +7.1% | Sensitive |
|
| 117 |
-
| 4 | 13,047 | +5.6% | Sensitive |
|
| 118 |
-
|
| 119 |
-
**Why layer 8 and not layer 10:** Layer 10 had the best PPL improvement (-3.9%), but full HumanEval testing showed it regresses code generation from 71.3% to 61.6%. Layer 10's router controls routing to code-relevant experts — degrading it hurts coding while helping general perplexity. Layer 8 improves PPL (-2.4%) AND HumanEval (+0.7%) with no regressions on any benchmark.
|
| 120 |
-
|
| 121 |
-
**Router stacking doesn't work:** Combined demotion of even the top 3 layers worsens PPL vs baseline. The model compensates for one degraded router but not multiple simultaneously. This is a routing compensation effect specific to MoE architectures.
|
| 122 |
-
|
| 123 |
-
**Precision curve for layer 8's router:**
|
| 124 |
-
|
| 125 |
-
| Precision | PPL | Delta |
|
| 126 |
-
|-----------|------|-------|
|
| 127 |
-
| F32 (default) | 12,356 | — |
|
| 128 |
-
| Q8_0 | 12,054 | -2.4% |
|
| 129 |
-
| Q4_0 | 12,355 | ~0% |
|
| 130 |
-
| Q6_K | 14,317 | +15.9% |
|
| 131 |
-
| Q2_K | 14,482 | +17.2% |
|
| 132 |
-
|
| 133 |
-
Q8_0 is the only precision that improves PPL. K-quant formats (Q6_K, Q2_K) use 256-element super-blocks with sub-block scales — this structure disrupts the router's fine-grained expert selection. Q8_0's simpler per-block rounding acts as beneficial regularization.
|
| 134 |
-
|
| 135 |
-
### Final Precision Map (v6)
|
| 136 |
-
|
| 137 |
-
| Tensor Type | Precision | Count | Rationale |
|
| 138 |
-
|-------------|-----------|-------|-----------|
|
| 139 |
-
| attn_q (9 sensitive layers) | Q5_K | 9 | Layer-validated critical |
|
| 140 |
-
| attn_q (remaining) | Q2_K | 21 | Group-level demotable |
|
| 141 |
-
| attn_k (7 promoted layers) | Q3_K | 7 | Reverse ablation: improve when promoted |
|
| 142 |
-
| attn_k (remaining) | Q2_K | 23 | Group-level demotable |
|
| 143 |
-
| ffn_up | Q2_K | 30 | Group PPL delta: -18.2% |
|
| 144 |
-
| expert_gate_up | Q2_K | 30 | Group PPL delta: -5.5% |
|
| 145 |
-
| ffn_gate | Q3_K | 30 | Tolerant (-1.2%) |
|
| 146 |
-
| ffn_gate_inp layer 8 (router) | Q8_0 | 1 | Per-layer surgery: -2.4% PPL, +0.7% HumanEval |
|
| 147 |
-
| ffn_gate_inp (router, other) | F32 | 29 | Group PPL delta: +30.7% when crushed |
|
| 148 |
-
| Norms, scales | F32 | 392 | Structural — always full precision |
|
| 149 |
|
| 150 |
-
|
|
|
|
|
|
|
| 151 |
|
| 152 |
-
|
| 153 |
|
| 154 |
-
|
| 155 |
|
| 156 |
-
|
| 157 |
-
- `mmproj-google_gemma-4-26B-A4B-it-f16.gguf` — the vision encoder + projector (1.14 GB)
|
| 158 |
|
| 159 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 160 |
|
| 161 |
-
|
| 162 |
|
| 163 |
```bash
|
| 164 |
llama-server \
|
| 165 |
-
-
|
| 166 |
-
--mmproj
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 167 |
--jinja \
|
| 168 |
-
--reasoning
|
| 169 |
-
--
|
| 170 |
-
-ngl 99 \
|
| 171 |
-
-c 4096
|
| 172 |
```
|
| 173 |
|
| 174 |
-
|
| 175 |
-
- `--mmproj` — loads the vision encoder. The mmproj filename starts with `mmproj-` so it also works with `--mmproj-auto` auto-download if placed in the same directory.
|
| 176 |
-
- `--jinja` — enables the Gemma 4 chat template (embedded in the GGUF; required for correct formatting)
|
| 177 |
-
- `--reasoning off --reasoning-budget 0` — disables thinking mode which can cause infinite loops without dedicated reasoning tokens
|
| 178 |
|
| 179 |
-
|
| 180 |
-
|
| 181 |
-
|
| 182 |
-
|
| 183 |
-
|
| 184 |
-
-d '{
|
| 185 |
-
"model": "gemma4-cerebellum",
|
| 186 |
-
"messages": [
|
| 187 |
-
{
|
| 188 |
-
"role": "user",
|
| 189 |
-
"content": [
|
| 190 |
-
{"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}},
|
| 191 |
-
{"type": "text", "text": "What is shown in this image?"}
|
| 192 |
-
]
|
| 193 |
-
}
|
| 194 |
-
]
|
| 195 |
-
}'
|
| 196 |
```
|
| 197 |
|
| 198 |
-
|
| 199 |
|
| 200 |
-
|
| 201 |
|
| 202 |
-
|
| 203 |
-
|
| 204 |
-
|
| 205 |
-
|
|
|
|
| 206 |
```
|
| 207 |
|
| 208 |
-
|
| 209 |
-
|
| 210 |
-
|
| 211 |
-
|
|
|
|
| 212 |
```
|
| 213 |
|
| 214 |
-
|
| 215 |
|
| 216 |
-
|
| 217 |
-
-
|
| 218 |
-
-
|
| 219 |
-
|
|
|
|
|
|
|
| 220 |
|
| 221 |
-
|
|
|
|
|
|
|
| 222 |
|
| 223 |
-
##
|
| 224 |
|
| 225 |
-
|
| 226 |
-
|
| 227 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 228 |
|
| 229 |
-
|
| 230 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 231 |
```
|
| 232 |
|
| 233 |
-
##
|
| 234 |
|
| 235 |
-
|
| 236 |
-
|
| 237 |
-
|
|
|
|
|
|
|
|
|
|
| 238 |
```
|
| 239 |
|
| 240 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 241 |
|
| 242 |
-
##
|
| 243 |
|
| 244 |
-
-
|
| 245 |
-
-
|
| 246 |
-
|
| 247 |
-
|
| 248 |
-
-
|
| 249 |
-
|
|
|
|
| 250 |
|
| 251 |
## Credits
|
| 252 |
|
| 253 |
-
-
|
| 254 |
-
-
|
| 255 |
-
-
|
| 256 |
-
- **Method & quantization**: [deucebucket/osmosis](https://github.com/deucebucket/osmosis) — Cerebellum pipeline
|
|
|
|
| 3 |
library_name: gguf
|
| 4 |
base_model: google/gemma-4-26B-A4B-it
|
| 5 |
base_model_relation: quantized
|
| 6 |
+
model_name: Gemma-4-26B-A4B-it-Cerebellum-v6.1-templatefix-GGUF
|
| 7 |
model_creator: google
|
| 8 |
model_type: gemma4
|
| 9 |
quantized_by: deucebucket
|
| 10 |
+
pipeline_tag: text-generation
|
| 11 |
tags:
|
| 12 |
- GGUF
|
| 13 |
- gemma4
|
|
|
|
| 18 |
- imatrix
|
| 19 |
- moe
|
| 20 |
- 3-bit
|
| 21 |
+
- templatefix
|
|
|
|
|
|
|
| 22 |
---
|
| 23 |
|
| 24 |
+
# Gemma 4 26B-A4B-it Cerebellum GGUF
|
| 25 |
|
| 26 |
+
This repository contains GGUF builds derived from
|
| 27 |
+
`google/gemma-4-26B-A4B-it`.
|
| 28 |
|
| 29 |
+
## 2026-05-22 Update
|
| 30 |
|
| 31 |
+
Added:
|
| 32 |
|
| 33 |
+
```text
|
| 34 |
+
gemma-4-26B-A4B-it-cerebellum-v6.1-templatefix.gguf
|
| 35 |
+
sha256: d24229facdef8360a7ffa8b37a50e1de636b9139a5eba0efe899828e45ae7989
|
| 36 |
|
| 37 |
+
gemma-4-26b-a4b-it.mmproj.gguf
|
| 38 |
+
sha256: b762c43119ebdc3e3c36d929d958e827fac35b03278dda9203f87131aee1f185
|
| 39 |
+
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 40 |
|
| 41 |
+
The v6.1 file keeps the v6 tensor allocation and updates GGUF/runtime-facing
|
| 42 |
+
metadata for Gemma 4 chat-template use. The update was tested with
|
| 43 |
+
`llama-server --jinja --reasoning auto` and request-level no-thinking controls.
|
| 44 |
|
| 45 |
+
Older files in this repository are retained for reproducibility.
|
| 46 |
|
| 47 |
+
## Tested Runtime
|
| 48 |
|
| 49 |
+
Runtime used for the 2026-05-22 templatefix checks:
|
|
|
|
| 50 |
|
| 51 |
+
```text
|
| 52 |
+
llama.cpp fork: https://github.com/deucebucket/llama.cpp
|
| 53 |
+
branch: cerebellum/gemma4-runtime-fixes
|
| 54 |
+
fork commit: ded491334 fix: harden Gemma 4 server budgets
|
| 55 |
+
base build: b8930-59fa0b455
|
| 56 |
+
```
|
| 57 |
|
| 58 |
+
Server shape used locally:
|
| 59 |
|
| 60 |
```bash
|
| 61 |
llama-server \
|
| 62 |
+
--model gemma-4-26B-A4B-it-cerebellum-v6.1-templatefix.gguf \
|
| 63 |
+
--mmproj gemma-4-26b-a4b-it.mmproj.gguf \
|
| 64 |
+
--n-gpu-layers 99 \
|
| 65 |
+
--ctx-size 65536 \
|
| 66 |
+
--parallel 1 \
|
| 67 |
+
--flash-attn on \
|
| 68 |
+
--cache-type-k q8_0 \
|
| 69 |
+
--cache-type-v q8_0 \
|
| 70 |
--jinja \
|
| 71 |
+
--reasoning auto \
|
| 72 |
+
--media-path /tmp/
|
|
|
|
|
|
|
| 73 |
```
|
| 74 |
|
| 75 |
+
Normal no-thinking requests used:
|
|
|
|
|
|
|
|
|
|
| 76 |
|
| 77 |
+
```json
|
| 78 |
+
{
|
| 79 |
+
"chat_template_kwargs": {"enable_thinking": false},
|
| 80 |
+
"thinking_budget_tokens": 0
|
| 81 |
+
}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 82 |
```
|
| 83 |
|
| 84 |
+
Bounded-thinking smoke requests used `thinking_budget_tokens: 128`.
|
| 85 |
|
| 86 |
+
## 2026-05-22 Templatefix Test Artifacts
|
| 87 |
|
| 88 |
+
Creative-writing smoke files:
|
| 89 |
+
|
| 90 |
+
```text
|
| 91 |
+
creative_eval_20260522/regular_v6_1_templatefix_creative_summary.json
|
| 92 |
+
creative_eval_20260522/regular_v6_1_templatefix_creative_rerun_longcaps_summary.json
|
| 93 |
```
|
| 94 |
|
| 95 |
+
Non-coding tool-use files:
|
| 96 |
+
|
| 97 |
+
```text
|
| 98 |
+
agentic_eval_20260522/README.md
|
| 99 |
+
agentic_eval_20260522/regular_v6_1_noncoding_agentic_tools_strict_summary.json
|
| 100 |
```
|
| 101 |
|
| 102 |
+
Observed 2026-05-22 results from those artifacts:
|
| 103 |
|
| 104 |
+
| Area | Harness | Observed result |
|
| 105 |
+
|---|---|---|
|
| 106 |
+
| No-thinking output channel | six creative prompts | `reasoning_len=0` in recorded outputs |
|
| 107 |
+
| Template leakage markers | six creative prompts | no `<think>` marker or template marker recorded by checker |
|
| 108 |
+
| Creative long-cap rerun | four prompts rerun after initial length caps | four stop finishes in rerun summary |
|
| 109 |
+
| Non-coding tool workflow | three strict OpenAI-style tool tasks | `schedule_strict`, `release_notes_strict`, `creative_brief_strict` listed in `pass_cases` |
|
| 110 |
|
| 111 |
+
The non-coding tool harness used mock tools named `list_calendar`,
|
| 112 |
+
`create_calendar_hold`, `search_notes`, `save_note`, and `add_task`. It did not
|
| 113 |
+
test code editing.
|
| 114 |
|
| 115 |
+
## Historical Same-Repo Benchmark Artifacts
|
| 116 |
|
| 117 |
+
The following benchmark artifacts are from the earlier v6 line and the local
|
| 118 |
+
Q3_K_M baseline. They are included as historical same-project measurements, not
|
| 119 |
+
as new v6.1 measurements.
|
| 120 |
+
|
| 121 |
+
| Artifact set | ARC-Challenge | HellaSwag | MMLU-Redux | HumanEval note |
|
| 122 |
+
|---|---:|---:|---:|---|
|
| 123 |
+
| `q3km_baseline_*` | 95.2218 | 86.5664 | 73.6667 | `q3km_baseline_humaneval_results.json`: 62.2 pass@1 |
|
| 124 |
+
| `cerebellum_v6_*` | 95.5631 | 84.55 | 71.3333 | v6 HumanEval artifacts are retained but marked for audit in local notes |
|
| 125 |
|
| 126 |
+
For Gemma 4 HumanEval/EvalPlus, the local protocol now uses chat completions,
|
| 127 |
+
not raw completions:
|
| 128 |
+
|
| 129 |
+
```text
|
| 130 |
+
llama-server --jinja --reasoning auto
|
| 131 |
+
chat_template_kwargs: {"enable_thinking": false}
|
| 132 |
+
thinking_budget_tokens: 0
|
| 133 |
+
BENCH_WORKERS=1
|
| 134 |
```
|
| 135 |
|
| 136 |
+
## Files and Provenance
|
| 137 |
|
| 138 |
+
Main v6.1 GGUF:
|
| 139 |
+
|
| 140 |
+
```text
|
| 141 |
+
source base: google/gemma-4-26B-A4B-it
|
| 142 |
+
quantization family: mixed-precision GGUF
|
| 143 |
+
recipe lineage: Cerebellum v6 tensor allocation
|
| 144 |
```
|
| 145 |
|
| 146 |
+
Matching mmproj:
|
| 147 |
+
|
| 148 |
+
```text
|
| 149 |
+
gemma-4-26b-a4b-it.mmproj.gguf
|
| 150 |
+
```
|
| 151 |
|
| 152 |
+
## Notes
|
| 153 |
|
| 154 |
+
- The 2026-05-22 tests were run on local `llama-server`.
|
| 155 |
+
- The opencode coding-agent test is not used as a model-card result. In one
|
| 156 |
+
internal White and Black project run, the model connected through the harness
|
| 157 |
+
and ran a Godot test, then produced malformed edit-tool calls.
|
| 158 |
+
- The creative-writing checks are smoke tests plus mechanical checks, not a
|
| 159 |
+
human preference benchmark.
|
| 160 |
+
- The non-coding tool checks use mocked tools and fixed task definitions.
|
| 161 |
|
| 162 |
## Credits
|
| 163 |
|
| 164 |
+
- Base model: Google Gemma Team, `google/gemma-4-26B-A4B-it`
|
| 165 |
+
- GGUF/runtime: llama.cpp
|
| 166 |
+
- Quantization and local test artifacts: deucebucket Cerebellum workflow
|
|
|