Mixed Precision GGUF layer quantization of gemma-4-26B-A4B-it-qat-q4_0-unquantized by Google

Original model: https://huggingface.co/google/gemma-4-26B-A4B-it-qat-q4_0-unquantized

The hybrid quant employs different quantization levels on a per layer basis to enable both high performance and small file size at the same time. The quants employed are all K to avoid slow CPU or older GPU processing of IQ quants. For this file the Q4_P_H layer quants using extended K layer definitions are as follows:

  LAYER_TYPES='[
   ["A","attn","Q","attn_q","K","attn_k","V","attn_v","O","attn_o","S","ssm","F","ffn","G","ffn_g","U","ffn_u","D","ffn_d"],
   ["MAP","VOD","0","QN_K","2","Q2_K","3","Q3_K","4","Q4_0","5","Q5_0","6","Q5_1","8","Q8_0","h","F16","f","F32"],
   [0 ,"Q5_K_655"],[1 ,"Q5_K_555"],[2 ,"Q4_K_555"],[3 ,"Q4_K_555"],[4 ,"Q4_K_544"],[5 ,"Q4_K_555"],[6 ,"Q4_K_544"],[7 ,"Q4_K_544"],
   [8 ,"Q4_K_544"],[9 ,"Q4_K_544"],[10,"Q4_K_544"],[11,"Q4_K_555"],[12,"Q4_K_544"],[13,"Q4_K_544"],[14,"Q4_K_544"],[15,"Q4_K_544"],
   [16,"Q4_K_654"],[17,"Q4_K_655"],[18,"Q4_K_654"],[19,"Q4_K_654"],[20,"Q4_K_654"],[21,"Q4_K_654"],[22,"Q4_K_654"],[23,"Q4_K_666"],
   [24,"Q4_K_655"],[25,"Q4_K_665"],[26,"Q5_K_665"],[27,"Q5_K_666"],[28,"Q5_K_668"],[29,"Q6_K_866"]
   ]'
   FLAGS="--token-embedding-type Q5_1 --output-tensor-type Q5_1 --layer-types-high --tensor-pad [[704,768],[1408,1536,2],[2112,2304]] --override-kv gemma4.expert_feed_forward_length=int:768  --override-kv gemma4.feed_forward_length=int:2304 

The FFN tensors in this model are padded to enable use of K quants. The quant was tested for very strong performance over a small set of curated reasoning prompts and sized to approximately Q4_K_M bit efficiency.

Comparison:

Quant size PPL Comment
Q4_0 14.4e9 - google q4_0 quant, untested
Q4_P_H 16.8e9 13.0 modified PPL, ~ Q4_K_M size, all legacy + K quants

Usage:

gemma 4 26B A4B it qat q4_0 unquantized is a vision capable moe RL model. It can be used together with its multimedia projector layers to process images and text inputs and generate text outputs. The mmproj file is made available in this repository.

Thinking:

By default the model will not create a RL reasoning block and just outputs

<|channel>thought
<channel|>

at the start of gen. To get it to fill in the think block use a system prompt with:

<|think|>

as the first token. This is a special token in the model vocab and must be tokenized as such to work. No other text in the system prompt besides the think token is needed to get it to fill in the RL block though other text can be added if desired.

The model was found to be highly capable on reasoning tasks when skipping think block.

Running:

Use of speculation with the model is not recommended since it is a moe with 4B activated experts meaning it can be efficiently run on CPU, but CPU does not have enough parallel hardware to leverage processing many tokens in a batch at once. Thus gen rate will go down if speculation is used with experts processed on CPU. If the model is run fully on GPU speculation can be used with the model. A recommended low overhead speculator is gemma-3-270m-it-256k. To use this speculator the inference platform must support dynamic vocab translation between draft and target.

The model can be run fully offloaded into 24G VRAM, or with CPU and expert layer offload via config OT="-ot exps=CPU -ngl 99". Because the model is a 4B active moe the CPU expert offload still gives good gen rate with very large context available, however prompt processing will be very slow for large contexts with experts offloaded to CPU.

On a 9900k/4070 or 2x 4070 setup (1 RPC) approx performance for the Q4_E_H quant is:

CONFIG (no vision tower) QKV NKV gen tps pp tps (batch 128)
4070+ 9900k CPU exp offload F16 256k 23 ~100
2x4070 (RPC) F16 128k+ 70 ~700
2x4070 (RPC) + spec , ND=2 F16 128k+ 77

Nothink example (greedy deterministic sampling):

lm "I want to wash my car. The car wash is 50 meters away. Should I walk or drive?"
<|channel>thought
<channel|>Technically, you can do either, but the best choice depends on your current situation:

**Drive if:**
* **You are already in the car:** If you are sitting in the driver's seat, just drive. It’s much easier than getting out, locking the door, walking 50 meters, and then walking back.
* **You have heavy supplies:** If you are carrying buckets, heavy cleaning chemicals, or a vacuum, driving is much more efficient.
* **You are wearing formal clothes:** If you are dressed for work or an event, you might want to avoid the walk.

**Walk if:**
* **You are currently on foot:** If you are walking down the street and see the car wash, it’s easier to walk there and then walk back to your car.
* **You want to save a tiny bit of gas:** 50 meters is about 165 feet. It’s a very short distance, so walking is better for the environment and your fuel tank.
* **You want to stretch your legs:** If you've been sitting for a long time, a 1-minute walk is a good way to move around.

**The Verdict:**
If you are already in the car, **drive**. If you are standing next to the car on the sidewalk, **drive** (it's only 50 meters!).

Think example illustrating introspections kicking in for a significantly improved gen (greedy deterministic sampling):

SYSTEM="<|think|>" SPECIAL=1 lm "I want to wash my car. The car wash is 50 meters away. Should I walk or drive?"
<|channel>thought

*   Goal: Wash the car.
*   Distance to car wash: 50 meters.
*   Question: Should I walk or drive?

    *   *Option A: Driving*
        *   Pros: Faster (marginally), less physical effort.
        *   Cons: Uses fuel/electricity, adds wear and tear to the car, takes time to start/park/maneuver, potentially more pollution, might be overkill for 50m.

    *   *Option B: Walking*
        *   Pros: Zero fuel cost, zero wear and tear, good for light exercise, environmentally friendly, easy to park the car *at* the destination and walk back if needed (though usually, you drive to the wash, wash it, and drive away).
        *   Cons: Takes slightly longer (though 50m is negligible), might be inconvenient if carrying heavy cleaning supplies (but the user is going *to* the wash, implying a facility).

    *   *Scenario 1: The car is currently dirty.*
        *   If you drive a dirty car 50 meters, you're just spreading dirt/water on the road.
        *   If you walk, you save fuel.

    *   *Scenario 2: The car is currently clean (just going for a wash for maintenance).*
        *   Driving 50m is a very short trip.

    *   *Scenario 3: Logistics.*
        *   If the car wash is a self-service bay, you *must* drive the car there to wash it.
        *   If it's a drive-through, you *must* drive.
        *   If it's a hand-wash where you leave the car, you drive there.

    *   *Wait, the prompt is a bit of a "logic trap" or a "common sense" check.*
    *   If you are washing the car, the car needs to be *at* the car wash.
    *   To get the car to the car wash, you **must drive it**.

    *   *Direct Answer:* You have to drive the car to the car wash (unless you are walking to a car wash to wash a *different* car, which is unlikely).
    *   *Nuance:* If the question implies "How should I get there to *start* the process?", the answer is still: you drive the car there because the car needs to be washed.

    *   *Wait, let's look at it from a different angle:* Is this a riddle?
    *   If you walk, the car stays where it is. If the car stays where it is, it doesn't get washed.
    *   Therefore, to wash the car, you must drive it.

    *   *Drafting the response:*
        1.  **The Logical Answer:** You must drive the car. If you walk, the car stays behind and doesn't get washed.
        2.  **The Practical Answer:** 50 meters is about a 1-minute walk or a 10-second drive. Since the car needs to be at the wash to be washed, driving is the only option.

    *   "You should **drive**."
    *   Reasoning: "If you walk, the car stays where it is, and you won't have a washed car!"

    *(Self-Correction: Is there any scenario where walking makes sense? Maybe if you are walking to a car wash to pick up a car you left there? No, the prompt says "I want to wash my car.")*

    *Final Conclusion:* You must drive the car to the car wash.

    *Wait, let's consider if the user is asking about the "trip" to the car wash.*
    If the car is already at the car wash? No, "The car wash is 50 meters away."
    If the user is walking to the car wash to *get* the car? No.

    *Let's provide a witty but logical answer.*

    1.  **The Logical Necessity:** You have to drive the car there.
    2.  **The "Riddle" aspect:** If you walk, the car doesn't move.

    *Let's keep it simple.*

    "You should **drive**—because if you walk, the car stays where it is, and you won't have a washed car!"<channel|>You should **drive**.

If you walk, the car stays where it is, and you won't have a washed car!

The model passed two long context tests with F16 QKV running on 2x4070 showing usable prompt processing speed for 100k+ token prompts.

Vision:

The model was tested in vision mode on a couple pretty tough bird ID image and found to exhibit poor performance in both think and nonthink mode, but did at least consider the correct bird on one of the test images. As a comparision gemma3 27B went 1 for 2 and Qwen3 27B completely aces these tough (quite blurry images of a small bird) ID tests. The model did a great job on some text based image prompts though.

Code:

The model did a pretty good job on all the code test prompts, however it is not in the same league as the Qwen3 and 3.5 models for code gen.

Llama.cpp inference/isssues:

The model cannot compute valid perplexity due to the instruct tune forcing it to generate

<|channel>thought

as assitant gen independent of previous prompt contents. To work around this problem a modifed perplexity is computed by overwriting the beginning of the perplexity chunk contents with the forced assistent gen as follows:

      # chunk is a string of text to eval perplexity on
      injects='model\n<|channel>thought\n<channel|>'
      chunk="${injects}${chunk:${#injects}}"

logprobs are skipped over the beginning part of the perplexity prompt using a modified llama.cpp downstream server to compute perplexity. Discussion at: https://github.com/ggml-org/llama.cpp/issues/21388#issuecomment-4185889930

Benchmarks:

A full set of both math and vision benchmarks for the model are given here: https://huggingface.co/spaces/steampunque/benchlm

Download the file from below:

Link Type Size/e9 B Notes
gemma-4-26B-A4B-it-qat-q4_0-unquantized.Q4_P_H.gguf Q4_P_H 16.8e9 B ~ Q4_K_M size, tensors padded to enable K quant use on all layers
gemma-4-26B-A4B-it-qat-q4_0-unquantized.mmproj.gguf F16 1.2e9 B multimedia projector

A discussion thread about the hybrid layer quant approach can be found here on the llama.cpp git repository:

https://github.com/ggml-org/llama.cpp/discussions/13040

Downloads last month
229
GGUF
Model size
27B params
Architecture
gemma4
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for steampunque/gemma-4-26B-A4B-it-qat-q4_0-unquantized-MP-GGUF