deucebucket commited on
Commit
7a71962
Β·
verified Β·
1 Parent(s): 551a4f4

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +170 -0
README.md ADDED
@@ -0,0 +1,170 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: gemma
3
+ library_name: gguf
4
+ base_model: google/gemma-4-26B-A4B-it
5
+ base_model_relation: quantized
6
+ model_name: Gemma-4-26B-A4B-it-Cerebellum-v6-GGUF
7
+ model_creator: google
8
+ model_type: gemma4
9
+ quantized_by: deucebucket
10
+ pipeline_tag: text-generation
11
+ tags:
12
+ - GGUF
13
+ - gemma4
14
+ - gemma
15
+ - google
16
+ - quantized
17
+ - cerebellum
18
+ - imatrix
19
+ - moe
20
+ - 3-bit
21
+ - conversational
22
+ ---
23
+
24
+ # Gemma 4 26B-A4B-it β€” Cerebellum v6 GGUF
25
+
26
+ Cerebellum v6 is an ablation-guided mixed-precision GGUF quantization of [google/gemma-4-26B-A4B-it](https://huggingface.co/google/gemma-4-26B-A4B-it).
27
+
28
+ This is a 26B-parameter MoE model with 4B active parameters per token, 128 experts per layer, and 30 layers. This release uses tensor-level precision overrides selected from 140+ ablation experiments across six internal iterations, including per-layer MoE router surgery.
29
+
30
+ ## At a Glance
31
+
32
+ | | |
33
+ |---|---|
34
+ | **File** | `gemma-4-26B-A4B-it-cerebellum-v6.gguf` |
35
+ | **Size** | 11.7 GB |
36
+ | **Base model** | `google/gemma-4-26B-A4B-it` |
37
+ | **Base quant** | Q3_K_M with [bartowski's imatrix](https://huggingface.co/bartowski/google_gemma-4-26B-A4B-it-GGUF) |
38
+ | **Format** | GGUF, mixed precision |
39
+ | **Test hardware** | RTX 3090, llama.cpp |
40
+
41
+ ## Benchmarks
42
+
43
+ | Benchmark | Result |
44
+ |-----------|:------:|
45
+ | WikiText PPL | 12,054 |
46
+ | HumanEval pass@1 | 72.0% |
47
+ | ARC-Challenge | 95.6% |
48
+ | HellaSwag | 84.7% |
49
+ | MMLU-Redux | 71.2% |
50
+
51
+ All results measured locally on an RTX 3090 with llama.cpp. PPL was measured on the WikiText-2 test set with 2048 context and 128 chunks.
52
+
53
+ PPL is high in absolute terms for this model. This appears consistent across Gemma 4 26B quant levels tested locally and may reflect the model's MoE routing behavior on WikiText specifically.
54
+
55
+ ## What Changed: v1 Through v6
56
+
57
+ Each version added a new layer of ablation data. The method is always the same: change one thing, measure PPL, keep it only if it helps.
58
+
59
+ | Version | PPL | HumanEval | What Changed |
60
+ |---------|-----|-----------|-------------|
61
+ | v1 | 20,614 | 65.2% | Group-level ablation: 5 tensor groups tested at Q2_K |
62
+ | v2 | 19,826 | 65.9% | + attn_q per-layer ablation (30 layers tested, 9 promoted to Q5_K) |
63
+ | v3 | 19,826 | 67.1% | + PLE protection (norms/scales forced to F32) |
64
+ | v4 | 12,614 | 69.5% | + ffn_up per-layer ablation + precision rebalance |
65
+ | v5 | 12,356 | 71.3% | + attn_k reverse ablation (30 layers tested, 7 promoted to Q3_K) |
66
+ | **v6** | **12,054** | **72.0%** | + MoE router surgery: layer 8 ffn_gate_inp F32β†’Q8_0 |
67
+
68
+ ## How Cerebellum Works
69
+
70
+ Cerebellum assigns quantization precision per tensor based on measured impact. Each tensor group and individual layer is tested by changing its precision and measuring perplexity. Only changes that improve or maintain quality are kept.
71
+
72
+ ### Group Ablation
73
+
74
+ Each tensor category was tested at Q2_K and measured by PPL impact:
75
+
76
+ | Group | Tensors | PPL Delta | Action |
77
+ |-------|---------|-----------|--------|
78
+ | attn_q | 30 | +13.4% | Per-layer testing (9 layers need Q5_K) |
79
+ | ffn_gate | 30 | -1.2% | Left at Q3_K |
80
+ | expert_gate_up | 30 | -5.5% | Set to Q2_K |
81
+ | attn_k | 30 | -12.1% | Per-layer testing (7 layers benefit from Q3_K) |
82
+ | ffn_up | 30 | -18.2% | Set to Q2_K |
83
+
84
+ Three of five tested groups had lower PPL at Q2_K β€” meaning Q3_K_M was using bits on tensors that don't need them.
85
+
86
+ ### Layer Ablation
87
+
88
+ Groups with mixed results were tested per layer:
89
+
90
+ - **attn_q**: All 30 layers tested individually at Q2_K. 9 layers exceeded the sensitivity threshold and stay at Q5_K. The other 21 tolerate Q2_K.
91
+ - **attn_k**: All 30 layers tested individually. 7 layers showed PPL improvement when promoted from Q2_K to Q3_K (layer 23: -3.8%, layer 18: -2.8%). 4 layers (5, 11, 16, 29) were confirmed better at Q2_K.
92
+
93
+ ### MoE Router Surgery (New in v6)
94
+
95
+ llama-quantize ignores `--tensor-type-file` overrides for `ffn_gate_inp.weight` (MoE router) tensors. We built [gguf_tensor_surgery.py](https://github.com/deucebucket/osmosis/blob/master/scripts/gguf_tensor_surgery.py) to recast individual tensors directly in the GGUF file.
96
+
97
+ All 30 router layers were tested individually at Q8_0 (F32β†’Q8_0):
98
+
99
+ | Layer | PPL | Delta | Category |
100
+ |-------|------|-------|----------|
101
+ | 8 | 12,054 | -2.4% | Best universal candidate |
102
+ | 10 | 11,872 | -3.9% | Best PPL but regresses HumanEval (-9.7%) |
103
+ | 6 | 11,988 | -3.0% | Win (not stacked β€” routing compensation) |
104
+ | 9 | 12,044 | -2.5% | Win (not stacked) |
105
+ | 12 | 12,041 | -2.5% | Win (not stacked) |
106
+ | 23 | 12,052 | -2.5% | Win (not stacked) |
107
+ | 0 | 12,974 | +5.0% | Sensitive |
108
+ | 1 | 13,525 | +9.5% | Very sensitive |
109
+ | 2 | 13,239 | +7.1% | Sensitive |
110
+ | 4 | 13,047 | +5.6% | Sensitive |
111
+
112
+ **Why layer 8 and not layer 10:** Layer 10 had the best PPL improvement (-3.9%), but full HumanEval testing showed it regresses code generation from 71.3% to 61.6%. Layer 10's router controls routing to code-relevant experts β€” degrading it hurts coding while helping general perplexity. Layer 8 improves PPL (-2.4%) AND HumanEval (+0.7%) with no regressions on any benchmark.
113
+
114
+ **Router stacking doesn't work:** Combined demotion of even the top 3 layers worsens PPL vs baseline. The model compensates for one degraded router but not multiple simultaneously. This is a routing compensation effect specific to MoE architectures.
115
+
116
+ **Precision curve for layer 8's router:**
117
+
118
+ | Precision | PPL | Delta |
119
+ |-----------|------|-------|
120
+ | F32 (default) | 12,356 | β€” |
121
+ | Q8_0 | 12,054 | -2.4% |
122
+ | Q4_0 | 12,355 | ~0% |
123
+ | Q6_K | 14,317 | +15.9% |
124
+ | Q2_K | 14,482 | +17.2% |
125
+
126
+ Q8_0 is the only precision that improves PPL. K-quant formats (Q6_K, Q2_K) use 256-element super-blocks with sub-block scales β€” this structure disrupts the router's fine-grained expert selection. Q8_0's simpler per-block rounding acts as beneficial regularization.
127
+
128
+ ### Final Precision Map (v6)
129
+
130
+ | Tensor Type | Precision | Count | Rationale |
131
+ |-------------|-----------|-------|-----------|
132
+ | attn_q (9 sensitive layers) | Q5_K | 9 | Layer-validated critical |
133
+ | attn_q (remaining) | Q2_K | 21 | Group-level demotable |
134
+ | attn_k (7 promoted layers) | Q3_K | 7 | Reverse ablation: improve when promoted |
135
+ | attn_k (remaining) | Q2_K | 23 | Group-level demotable |
136
+ | ffn_up | Q2_K | 30 | Group PPL delta: -18.2% |
137
+ | expert_gate_up | Q2_K | 30 | Group PPL delta: -5.5% |
138
+ | ffn_gate | Q3_K | 30 | Tolerant (-1.2%) |
139
+ | ffn_gate_inp layer 8 (router) | Q8_0 | 1 | Per-layer surgery: -2.4% PPL, +0.7% HumanEval |
140
+ | ffn_gate_inp (router, other) | F32 | 29 | Group PPL delta: +30.7% when crushed |
141
+ | Norms, scales | F32 | 392 | Structural β€” always full precision |
142
+
143
+ 91 tensor-level overrides + 1 surgical router recast on top of Q3_K_M base.
144
+
145
+ ## Usage
146
+
147
+ ```bash
148
+ # llama.cpp
149
+ ./llama-server -m gemma-4-26B-A4B-it-cerebellum-v6.gguf -ngl 99 -c 4096
150
+
151
+ # ollama
152
+ ollama create gemma4-cerebellum -f Modelfile
153
+ ollama run gemma4-cerebellum
154
+ ```
155
+
156
+ Fits in 24 GB VRAM at full GPU offload with room for 4K context.
157
+
158
+ ## Technical Details
159
+
160
+ - **Architecture**: Gemma 4 26B β€” 26B total params, 4B active per token, 128 experts/layer, 30 layers
161
+ - **Base quant**: Q3_K_M with bartowski imatrix
162
+ - **Ablation experiments**: 140+ across 6 iterations (including 30-layer router surgery)
163
+ - **Quantizer**: llama.cpp `llama-quantize` with `--tensor-type-file` overrides + `gguf_tensor_surgery.py` for router recast
164
+ - **Hardware**: RTX 3090 (24 GB VRAM)
165
+
166
+ ## Credits
167
+
168
+ - **Base model**: [Google Gemma Team](https://huggingface.co/google/gemma-4-26B-A4B-it)
169
+ - **Imatrix**: [bartowski](https://huggingface.co/bartowski/google_gemma-4-26B-A4B-it-GGUF)
170
+ - **Method & quantization**: [deucebucket/osmosis](https://github.com/deucebucket/osmosis) β€” Cerebellum pipeline