deucebucket commited on
Commit
9935c1b
·
verified ·
1 Parent(s): 9a33cd5

docs: update templatefix test notes

Browse files
Files changed (1) hide show
  1. README.md +105 -195
README.md CHANGED
@@ -3,11 +3,11 @@ license: gemma
3
  library_name: gguf
4
  base_model: google/gemma-4-26B-A4B-it
5
  base_model_relation: quantized
6
- model_name: Gemma-4-26B-A4B-it-Cerebellum-v6-GGUF
7
  model_creator: google
8
  model_type: gemma4
9
  quantized_by: deucebucket
10
- pipeline_tag: image-text-to-text
11
  tags:
12
  - GGUF
13
  - gemma4
@@ -18,239 +18,149 @@ tags:
18
  - imatrix
19
  - moe
20
  - 3-bit
21
- - conversational
22
- - multimodal
23
- - vision
24
  ---
25
 
26
- # Gemma 4 26B-A4B-it Cerebellum v6 GGUF
27
 
28
- > **Numbers under audit (2026-05-08)** — an internal review found the v6 benchmark numbers below need to be re-measured against the same protocol used for v1–v4. A clean re-run with audited wrong-answers and per-question JSONLs is underway. The GGUF file itself is unchanged — this is a measurement issue, not a model issue. Treat the table as preliminary until corrected numbers replace it.
 
29
 
30
- Cerebellum v6 is an ablation-guided mixed-precision GGUF quantization of [google/gemma-4-26B-A4B-it](https://huggingface.co/google/gemma-4-26B-A4B-it).
31
 
32
- This is a 26B-parameter MoE model with 4B active parameters per token, 128 experts per layer, and 30 layers. This release uses tensor-level precision overrides selected from 140+ ablation experiments across six internal iterations, including per-layer MoE router surgery.
33
 
34
- **This model supports vision** when used with the included mmproj file. See [Vision Support](#vision-support) below.
 
 
35
 
36
- ## At a Glance
37
-
38
- | | |
39
- |---|---|
40
- | **File** | `gemma-4-26B-A4B-it-cerebellum-v6.gguf` |
41
- | **mmproj** | `mmproj-google_gemma-4-26B-A4B-it-f16.gguf` |
42
- | **Size** | 11.7 GB (backbone) + 1.14 GB (mmproj) |
43
- | **Base model** | `google/gemma-4-26B-A4B-it` |
44
- | **Base quant** | Q3_K_M with [bartowski's imatrix](https://huggingface.co/bartowski/google_gemma-4-26B-A4B-it-GGUF) |
45
- | **Format** | GGUF, mixed precision |
46
- | **Test hardware** | RTX 3090, llama.cpp |
47
-
48
- ## Benchmarks
49
-
50
- | Benchmark | Result |
51
- |-----------|:------:|
52
- | WikiText PPL | 12,054 |
53
- | HumanEval pass@1 | 72.0% |
54
- | ARC-Challenge | 95.6% |
55
- | HellaSwag | 84.7% |
56
- | MMLU-Redux | 71.2% |
57
-
58
- All results measured locally on an RTX 3090 with llama.cpp. PPL was measured on the WikiText-2 test set with 2048 context and 128 chunks.
59
-
60
- PPL is high in absolute terms for this model. This appears consistent across Gemma 4 26B quant levels tested locally and may reflect the model's MoE routing behavior on WikiText specifically.
61
-
62
- ## What Changed: v1 Through v6
63
-
64
- Each version added a new layer of ablation data. The method is always the same: change one thing, measure PPL, keep it only if it helps.
65
-
66
- | Version | PPL | HumanEval | What Changed |
67
- |---------|-----|-----------|-------------|
68
- | v1 | 20,614 | 65.2% | Group-level ablation: 5 tensor groups tested at Q2_K |
69
- | v2 | 19,826 | 65.9% | + attn_q per-layer ablation (30 layers tested, 9 promoted to Q5_K) |
70
- | v3 | 19,826 | 67.1% | + PLE protection (norms/scales forced to F32) |
71
- | v4 | 12,614 | 69.5% | + ffn_up per-layer ablation + precision rebalance |
72
- | v5 | 12,356 | 71.3% | + attn_k reverse ablation (30 layers tested, 7 promoted to Q3_K) |
73
- | **v6** | **12,054** | **72.0%** | + MoE router surgery: layer 8 ffn_gate_inp F32→Q8_0 |
74
-
75
- ## How Cerebellum Works
76
-
77
- Cerebellum assigns quantization precision per tensor based on measured impact. Each tensor group and individual layer is tested by changing its precision and measuring perplexity. Only changes that improve or maintain quality are kept.
78
-
79
- ### Group Ablation
80
-
81
- Each tensor category was tested at Q2_K and measured by PPL impact:
82
-
83
- | Group | Tensors | PPL Delta | Action |
84
- |-------|---------|-----------|--------|
85
- | attn_q | 30 | +13.4% | Per-layer testing (9 layers need Q5_K) |
86
- | ffn_gate | 30 | -1.2% | Left at Q3_K |
87
- | expert_gate_up | 30 | -5.5% | Set to Q2_K |
88
- | attn_k | 30 | -12.1% | Per-layer testing (7 layers benefit from Q3_K) |
89
- | ffn_up | 30 | -18.2% | Set to Q2_K |
90
-
91
- Three of five tested groups had lower PPL at Q2_K — meaning Q3_K_M was using bits on tensors that don't need them.
92
-
93
- ### Layer Ablation
94
-
95
- Groups with mixed results were tested per layer:
96
-
97
- - **attn_q**: All 30 layers tested individually at Q2_K. 9 layers exceeded the sensitivity threshold and stay at Q5_K. The other 21 tolerate Q2_K.
98
- - **attn_k**: All 30 layers tested individually. 7 layers showed PPL improvement when promoted from Q2_K to Q3_K (layer 23: -3.8%, layer 18: -2.8%). 4 layers (5, 11, 16, 29) were confirmed better at Q2_K.
99
-
100
- ### MoE Router Surgery (New in v6)
101
-
102
- llama-quantize ignores `--tensor-type-file` overrides for `ffn_gate_inp.weight` (MoE router) tensors. We built [gguf_tensor_surgery.py](https://github.com/deucebucket/osmosis/blob/master/scripts/gguf_tensor_surgery.py) to recast individual tensors directly in the GGUF file.
103
-
104
- All 30 router layers were tested individually at Q8_0 (F32→Q8_0):
105
-
106
- | Layer | PPL | Delta | Category |
107
- |-------|------|-------|----------|
108
- | 8 | 12,054 | -2.4% | Best universal candidate |
109
- | 10 | 11,872 | -3.9% | Best PPL but regresses HumanEval (-9.7%) |
110
- | 6 | 11,988 | -3.0% | Win (not stacked — routing compensation) |
111
- | 9 | 12,044 | -2.5% | Win (not stacked) |
112
- | 12 | 12,041 | -2.5% | Win (not stacked) |
113
- | 23 | 12,052 | -2.5% | Win (not stacked) |
114
- | 0 | 12,974 | +5.0% | Sensitive |
115
- | 1 | 13,525 | +9.5% | Very sensitive |
116
- | 2 | 13,239 | +7.1% | Sensitive |
117
- | 4 | 13,047 | +5.6% | Sensitive |
118
-
119
- **Why layer 8 and not layer 10:** Layer 10 had the best PPL improvement (-3.9%), but full HumanEval testing showed it regresses code generation from 71.3% to 61.6%. Layer 10's router controls routing to code-relevant experts — degrading it hurts coding while helping general perplexity. Layer 8 improves PPL (-2.4%) AND HumanEval (+0.7%) with no regressions on any benchmark.
120
-
121
- **Router stacking doesn't work:** Combined demotion of even the top 3 layers worsens PPL vs baseline. The model compensates for one degraded router but not multiple simultaneously. This is a routing compensation effect specific to MoE architectures.
122
-
123
- **Precision curve for layer 8's router:**
124
-
125
- | Precision | PPL | Delta |
126
- |-----------|------|-------|
127
- | F32 (default) | 12,356 | — |
128
- | Q8_0 | 12,054 | -2.4% |
129
- | Q4_0 | 12,355 | ~0% |
130
- | Q6_K | 14,317 | +15.9% |
131
- | Q2_K | 14,482 | +17.2% |
132
-
133
- Q8_0 is the only precision that improves PPL. K-quant formats (Q6_K, Q2_K) use 256-element super-blocks with sub-block scales — this structure disrupts the router's fine-grained expert selection. Q8_0's simpler per-block rounding acts as beneficial regularization.
134
-
135
- ### Final Precision Map (v6)
136
-
137
- | Tensor Type | Precision | Count | Rationale |
138
- |-------------|-----------|-------|-----------|
139
- | attn_q (9 sensitive layers) | Q5_K | 9 | Layer-validated critical |
140
- | attn_q (remaining) | Q2_K | 21 | Group-level demotable |
141
- | attn_k (7 promoted layers) | Q3_K | 7 | Reverse ablation: improve when promoted |
142
- | attn_k (remaining) | Q2_K | 23 | Group-level demotable |
143
- | ffn_up | Q2_K | 30 | Group PPL delta: -18.2% |
144
- | expert_gate_up | Q2_K | 30 | Group PPL delta: -5.5% |
145
- | ffn_gate | Q3_K | 30 | Tolerant (-1.2%) |
146
- | ffn_gate_inp layer 8 (router) | Q8_0 | 1 | Per-layer surgery: -2.4% PPL, +0.7% HumanEval |
147
- | ffn_gate_inp (router, other) | F32 | 29 | Group PPL delta: +30.7% when crushed |
148
- | Norms, scales | F32 | 392 | Structural — always full precision |
149
 
150
- 91 tensor-level overrides + 1 surgical router recast on top of Q3_K_M base.
 
 
151
 
152
- ## Vision Support
153
 
154
- This model is multimodal — it can process images alongside text. Vision requires two files used together:
155
 
156
- - `gemma-4-26B-A4B-it-cerebellum-v6.gguf` the text backbone (this file)
157
- - `mmproj-google_gemma-4-26B-A4B-it-f16.gguf` — the vision encoder + projector (1.14 GB)
158
 
159
- The image token `<|image|>` (token ID 258880) is already in the vocabulary. No metadata changes needed.
 
 
 
 
 
160
 
161
- ### Usage with llama-server
162
 
163
  ```bash
164
  llama-server \
165
- -m gemma-4-26B-A4B-it-cerebellum-v6.gguf \
166
- --mmproj mmproj-google_gemma-4-26B-A4B-it-f16.gguf \
 
 
 
 
 
 
167
  --jinja \
168
- --reasoning off \
169
- --reasoning-budget 0 \
170
- -ngl 99 \
171
- -c 4096
172
  ```
173
 
174
- **Required flags:**
175
- - `--mmproj` — loads the vision encoder. The mmproj filename starts with `mmproj-` so it also works with `--mmproj-auto` auto-download if placed in the same directory.
176
- - `--jinja` — enables the Gemma 4 chat template (embedded in the GGUF; required for correct formatting)
177
- - `--reasoning off --reasoning-budget 0` — disables thinking mode which can cause infinite loops without dedicated reasoning tokens
178
 
179
- ### Usage with curl
180
-
181
- ```bash
182
- curl http://localhost:8080/v1/chat/completions \
183
- -H "Content-Type: application/json" \
184
- -d '{
185
- "model": "gemma4-cerebellum",
186
- "messages": [
187
- {
188
- "role": "user",
189
- "content": [
190
- {"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}},
191
- {"type": "text", "text": "What is shown in this image?"}
192
- ]
193
- }
194
- ]
195
- }'
196
  ```
197
 
198
- ### Usage with ollama
199
 
200
- Create a `Modelfile`:
201
 
202
- ```dockerfile
203
- FROM ./gemma-4-26B-A4B-it-cerebellum-v6.gguf
204
- FROM ./mmproj-google_gemma-4-26B-A4B-it-f16.gguf
205
- TEMPLATE {{ .Prompt }}
 
206
  ```
207
 
208
- Then:
209
- ```bash
210
- ollama create gemma4-cerebellum -f Modelfile
211
- ollama run gemma4-cerebellum
 
212
  ```
213
 
214
- ### Technical notes
215
 
216
- - The vision encoder has 27 layers with 1152 hidden dimension, projecting to 2816 (matching the text model's embedding dimension).
217
- - Image resolution: native input size 224×224, patches of 16×16. Dynamic resolution supported.
218
- - The mmproj was converted using llama.cpp's `convert_hf_to_gguf.py` from the original Google model and is redistributed under the Apache 2.0 license. Conversion credit: [bartowski](https://huggingface.co/bartowski).
219
- - Vision works out of the box no special tokens, metadata edits, or re-quantization needed.
 
 
220
 
221
- ## Usage
 
 
222
 
223
- ### llama.cpp
224
 
225
- ```bash
226
- # Text only
227
- ./llama-server -m gemma-4-26B-A4B-it-cerebellum-v6.gguf -ngl 99 -c 4096
 
 
 
 
 
228
 
229
- # With vision
230
- ./llama-server -m gemma-4-26B-A4B-it-cerebellum-v6.gguf --mmproj mmproj-google_gemma-4-26B-A4B-it-f16.gguf --jinja --reasoning off --reasoning-budget 0 -ngl 99 -c 4096
 
 
 
 
 
 
231
  ```
232
 
233
- ### ollama
234
 
235
- ```bash
236
- ollama create gemma4-cerebellum -f Modelfile
237
- ollama run gemma4-cerebellum
 
 
 
238
  ```
239
 
240
- Fits in 24 GB VRAM at full GPU offload with room for 4K context.
 
 
 
 
241
 
242
- ## Technical Details
243
 
244
- - **Architecture**: Gemma 4 26B 26B total params, 4B active per token, 128 experts/layer, 30 layers
245
- - **Base quant**: Q3_K_M with bartowski imatrix
246
- - **Ablation experiments**: 140+ across 6 iterations (including 30-layer router surgery)
247
- - **Quantizer**: llama.cpp `llama-quantize` with `--tensor-type-file` overrides + `gguf_tensor_surgery.py` for router recast
248
- - **Hardware**: RTX 3090 (24 GB VRAM)
249
- - **Vision**: See [Vision Support](#vision-support) for details
 
250
 
251
  ## Credits
252
 
253
- - **Base model**: [Google Gemma Team](https://huggingface.co/google/gemma-4-26B-A4B-it)
254
- - **Imatrix**: [bartowski](https://huggingface.co/bartowski/google_gemma-4-26B-A4B-it-GGUF)
255
- - **mmproj conversion**: [bartowski](https://huggingface.co/bartowski/google_gemma-4-26B-A4B-it-GGUF)
256
- - **Method & quantization**: [deucebucket/osmosis](https://github.com/deucebucket/osmosis) — Cerebellum pipeline
 
3
  library_name: gguf
4
  base_model: google/gemma-4-26B-A4B-it
5
  base_model_relation: quantized
6
+ model_name: Gemma-4-26B-A4B-it-Cerebellum-v6.1-templatefix-GGUF
7
  model_creator: google
8
  model_type: gemma4
9
  quantized_by: deucebucket
10
+ pipeline_tag: text-generation
11
  tags:
12
  - GGUF
13
  - gemma4
 
18
  - imatrix
19
  - moe
20
  - 3-bit
21
+ - templatefix
 
 
22
  ---
23
 
24
+ # Gemma 4 26B-A4B-it Cerebellum GGUF
25
 
26
+ This repository contains GGUF builds derived from
27
+ `google/gemma-4-26B-A4B-it`.
28
 
29
+ ## 2026-05-22 Update
30
 
31
+ Added:
32
 
33
+ ```text
34
+ gemma-4-26B-A4B-it-cerebellum-v6.1-templatefix.gguf
35
+ sha256: d24229facdef8360a7ffa8b37a50e1de636b9139a5eba0efe899828e45ae7989
36
 
37
+ gemma-4-26b-a4b-it.mmproj.gguf
38
+ sha256: b762c43119ebdc3e3c36d929d958e827fac35b03278dda9203f87131aee1f185
39
+ ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
40
 
41
+ The v6.1 file keeps the v6 tensor allocation and updates GGUF/runtime-facing
42
+ metadata for Gemma 4 chat-template use. The update was tested with
43
+ `llama-server --jinja --reasoning auto` and request-level no-thinking controls.
44
 
45
+ Older files in this repository are retained for reproducibility.
46
 
47
+ ## Tested Runtime
48
 
49
+ Runtime used for the 2026-05-22 templatefix checks:
 
50
 
51
+ ```text
52
+ llama.cpp fork: https://github.com/deucebucket/llama.cpp
53
+ branch: cerebellum/gemma4-runtime-fixes
54
+ fork commit: ded491334 fix: harden Gemma 4 server budgets
55
+ base build: b8930-59fa0b455
56
+ ```
57
 
58
+ Server shape used locally:
59
 
60
  ```bash
61
  llama-server \
62
+ --model gemma-4-26B-A4B-it-cerebellum-v6.1-templatefix.gguf \
63
+ --mmproj gemma-4-26b-a4b-it.mmproj.gguf \
64
+ --n-gpu-layers 99 \
65
+ --ctx-size 65536 \
66
+ --parallel 1 \
67
+ --flash-attn on \
68
+ --cache-type-k q8_0 \
69
+ --cache-type-v q8_0 \
70
  --jinja \
71
+ --reasoning auto \
72
+ --media-path /tmp/
 
 
73
  ```
74
 
75
+ Normal no-thinking requests used:
 
 
 
76
 
77
+ ```json
78
+ {
79
+ "chat_template_kwargs": {"enable_thinking": false},
80
+ "thinking_budget_tokens": 0
81
+ }
 
 
 
 
 
 
 
 
 
 
 
 
82
  ```
83
 
84
+ Bounded-thinking smoke requests used `thinking_budget_tokens: 128`.
85
 
86
+ ## 2026-05-22 Templatefix Test Artifacts
87
 
88
+ Creative-writing smoke files:
89
+
90
+ ```text
91
+ creative_eval_20260522/regular_v6_1_templatefix_creative_summary.json
92
+ creative_eval_20260522/regular_v6_1_templatefix_creative_rerun_longcaps_summary.json
93
  ```
94
 
95
+ Non-coding tool-use files:
96
+
97
+ ```text
98
+ agentic_eval_20260522/README.md
99
+ agentic_eval_20260522/regular_v6_1_noncoding_agentic_tools_strict_summary.json
100
  ```
101
 
102
+ Observed 2026-05-22 results from those artifacts:
103
 
104
+ | Area | Harness | Observed result |
105
+ |---|---|---|
106
+ | No-thinking output channel | six creative prompts | `reasoning_len=0` in recorded outputs |
107
+ | Template leakage markers | six creative prompts | no `<think>` marker or template marker recorded by checker |
108
+ | Creative long-cap rerun | four prompts rerun after initial length caps | four stop finishes in rerun summary |
109
+ | Non-coding tool workflow | three strict OpenAI-style tool tasks | `schedule_strict`, `release_notes_strict`, `creative_brief_strict` listed in `pass_cases` |
110
 
111
+ The non-coding tool harness used mock tools named `list_calendar`,
112
+ `create_calendar_hold`, `search_notes`, `save_note`, and `add_task`. It did not
113
+ test code editing.
114
 
115
+ ## Historical Same-Repo Benchmark Artifacts
116
 
117
+ The following benchmark artifacts are from the earlier v6 line and the local
118
+ Q3_K_M baseline. They are included as historical same-project measurements, not
119
+ as new v6.1 measurements.
120
+
121
+ | Artifact set | ARC-Challenge | HellaSwag | MMLU-Redux | HumanEval note |
122
+ |---|---:|---:|---:|---|
123
+ | `q3km_baseline_*` | 95.2218 | 86.5664 | 73.6667 | `q3km_baseline_humaneval_results.json`: 62.2 pass@1 |
124
+ | `cerebellum_v6_*` | 95.5631 | 84.55 | 71.3333 | v6 HumanEval artifacts are retained but marked for audit in local notes |
125
 
126
+ For Gemma 4 HumanEval/EvalPlus, the local protocol now uses chat completions,
127
+ not raw completions:
128
+
129
+ ```text
130
+ llama-server --jinja --reasoning auto
131
+ chat_template_kwargs: {"enable_thinking": false}
132
+ thinking_budget_tokens: 0
133
+ BENCH_WORKERS=1
134
  ```
135
 
136
+ ## Files and Provenance
137
 
138
+ Main v6.1 GGUF:
139
+
140
+ ```text
141
+ source base: google/gemma-4-26B-A4B-it
142
+ quantization family: mixed-precision GGUF
143
+ recipe lineage: Cerebellum v6 tensor allocation
144
  ```
145
 
146
+ Matching mmproj:
147
+
148
+ ```text
149
+ gemma-4-26b-a4b-it.mmproj.gguf
150
+ ```
151
 
152
+ ## Notes
153
 
154
+ - The 2026-05-22 tests were run on local `llama-server`.
155
+ - The opencode coding-agent test is not used as a model-card result. In one
156
+ internal White and Black project run, the model connected through the harness
157
+ and ran a Godot test, then produced malformed edit-tool calls.
158
+ - The creative-writing checks are smoke tests plus mechanical checks, not a
159
+ human preference benchmark.
160
+ - The non-coding tool checks use mocked tools and fixed task definitions.
161
 
162
  ## Credits
163
 
164
+ - Base model: Google Gemma Team, `google/gemma-4-26B-A4B-it`
165
+ - GGUF/runtime: llama.cpp
166
+ - Quantization and local test artifacts: deucebucket Cerebellum workflow