Instructions to use steampunque/gemma-4-26B-A4B-it-qat-q4_0-unquantized-MP-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use steampunque/gemma-4-26B-A4B-it-qat-q4_0-unquantized-MP-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="steampunque/gemma-4-26B-A4B-it-qat-q4_0-unquantized-MP-GGUF", filename="gemma-4-26B-A4B-it-qat-q4_0-unquantized.Q4_P_H.gguf", )
llm.create_chat_completion( messages = "No input example has been defined for this model task." )
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use steampunque/gemma-4-26B-A4B-it-qat-q4_0-unquantized-MP-GGUF with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf steampunque/gemma-4-26B-A4B-it-qat-q4_0-unquantized-MP-GGUF:Q4_0 # Run inference directly in the terminal: llama-cli -hf steampunque/gemma-4-26B-A4B-it-qat-q4_0-unquantized-MP-GGUF:Q4_0
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf steampunque/gemma-4-26B-A4B-it-qat-q4_0-unquantized-MP-GGUF:Q4_0 # Run inference directly in the terminal: llama-cli -hf steampunque/gemma-4-26B-A4B-it-qat-q4_0-unquantized-MP-GGUF:Q4_0
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf steampunque/gemma-4-26B-A4B-it-qat-q4_0-unquantized-MP-GGUF:Q4_0 # Run inference directly in the terminal: ./llama-cli -hf steampunque/gemma-4-26B-A4B-it-qat-q4_0-unquantized-MP-GGUF:Q4_0
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf steampunque/gemma-4-26B-A4B-it-qat-q4_0-unquantized-MP-GGUF:Q4_0 # Run inference directly in the terminal: ./build/bin/llama-cli -hf steampunque/gemma-4-26B-A4B-it-qat-q4_0-unquantized-MP-GGUF:Q4_0
Use Docker
docker model run hf.co/steampunque/gemma-4-26B-A4B-it-qat-q4_0-unquantized-MP-GGUF:Q4_0
- LM Studio
- Jan
- Ollama
How to use steampunque/gemma-4-26B-A4B-it-qat-q4_0-unquantized-MP-GGUF with Ollama:
ollama run hf.co/steampunque/gemma-4-26B-A4B-it-qat-q4_0-unquantized-MP-GGUF:Q4_0
- Unsloth Studio
How to use steampunque/gemma-4-26B-A4B-it-qat-q4_0-unquantized-MP-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for steampunque/gemma-4-26B-A4B-it-qat-q4_0-unquantized-MP-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for steampunque/gemma-4-26B-A4B-it-qat-q4_0-unquantized-MP-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for steampunque/gemma-4-26B-A4B-it-qat-q4_0-unquantized-MP-GGUF to start chatting
- Pi
How to use steampunque/gemma-4-26B-A4B-it-qat-q4_0-unquantized-MP-GGUF with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf steampunque/gemma-4-26B-A4B-it-qat-q4_0-unquantized-MP-GGUF:Q4_0
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "steampunque/gemma-4-26B-A4B-it-qat-q4_0-unquantized-MP-GGUF:Q4_0" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use steampunque/gemma-4-26B-A4B-it-qat-q4_0-unquantized-MP-GGUF with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf steampunque/gemma-4-26B-A4B-it-qat-q4_0-unquantized-MP-GGUF:Q4_0
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default steampunque/gemma-4-26B-A4B-it-qat-q4_0-unquantized-MP-GGUF:Q4_0
Run Hermes
hermes
- Atomic Chat new
- Docker Model Runner
How to use steampunque/gemma-4-26B-A4B-it-qat-q4_0-unquantized-MP-GGUF with Docker Model Runner:
docker model run hf.co/steampunque/gemma-4-26B-A4B-it-qat-q4_0-unquantized-MP-GGUF:Q4_0
- Lemonade
How to use steampunque/gemma-4-26B-A4B-it-qat-q4_0-unquantized-MP-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull steampunque/gemma-4-26B-A4B-it-qat-q4_0-unquantized-MP-GGUF:Q4_0
Run and chat with the model
lemonade run user.gemma-4-26B-A4B-it-qat-q4_0-unquantized-MP-GGUF-Q4_0
List all available models
lemonade list
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for steampunque/gemma-4-26B-A4B-it-qat-q4_0-unquantized-MP-GGUF to start chattingUsing HuggingFace Spaces for Unsloth
# No setup required# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for steampunque/gemma-4-26B-A4B-it-qat-q4_0-unquantized-MP-GGUF to start chattingMixed Precision GGUF layer quantization of gemma-4-26B-A4B-it-qat-q4_0-unquantized by Google
Original model: https://huggingface.co/google/gemma-4-26B-A4B-it-qat-q4_0-unquantized
The hybrid quant employs different quantization levels on a per layer basis to enable both high performance and small file size at the same time. The quants employed are all K to avoid slow CPU or older GPU processing of IQ quants. For this file the Q4_P_H layer quants using extended K layer definitions are as follows:
LAYER_TYPES='[
["A","attn","Q","attn_q","K","attn_k","V","attn_v","O","attn_o","S","ssm","F","ffn","G","ffn_g","U","ffn_u","D","ffn_d"],
["MAP","VOD","0","QN_K","2","Q2_K","3","Q3_K","4","Q4_0","5","Q5_0","6","Q5_1","8","Q8_0","h","F16","f","F32"],
[0 ,"Q5_K_655"],[1 ,"Q5_K_555"],[2 ,"Q4_K_555"],[3 ,"Q4_K_555"],[4 ,"Q4_K_544"],[5 ,"Q4_K_555"],[6 ,"Q4_K_544"],[7 ,"Q4_K_544"],
[8 ,"Q4_K_544"],[9 ,"Q4_K_544"],[10,"Q4_K_544"],[11,"Q4_K_555"],[12,"Q4_K_544"],[13,"Q4_K_544"],[14,"Q4_K_544"],[15,"Q4_K_544"],
[16,"Q4_K_654"],[17,"Q4_K_655"],[18,"Q4_K_654"],[19,"Q4_K_654"],[20,"Q4_K_654"],[21,"Q4_K_654"],[22,"Q4_K_654"],[23,"Q4_K_666"],
[24,"Q4_K_655"],[25,"Q4_K_665"],[26,"Q5_K_665"],[27,"Q5_K_666"],[28,"Q5_K_668"],[29,"Q6_K_866"]
]'
FLAGS="--token-embedding-type Q5_1 --output-tensor-type Q5_1 --layer-types-high --tensor-pad [[704,768],[1408,1536,2],[2112,2304]] --override-kv gemma4.expert_feed_forward_length=int:768 --override-kv gemma4.feed_forward_length=int:2304
The FFN tensors in this model are padded to enable use of K quants. The quant was tested for very strong performance over a small set of curated reasoning prompts and sized to approximately Q4_K_M bit efficiency.
Comparison:
| Quant | size | PPL | Comment |
|---|---|---|---|
| Q4_0 | 14.4e9 | - | google q4_0 quant, untested |
| Q4_P_H | 16.8e9 | 13.0 | modified PPL, ~ Q4_K_M size, all legacy + K quants |
Usage:
gemma 4 26B A4B it qat q4_0 unquantized is a vision capable moe RL model. It can be used together with its multimedia projector layers to process images and text inputs and generate text outputs. The mmproj file is made available in this repository.
Thinking:
By default the model will not create a RL reasoning block and just outputs
<|channel>thought
<channel|>
at the start of gen. To get it to fill in the think block use a system prompt with:
<|think|>
as the first token. This is a special token in the model vocab and must be tokenized as such to work. No other text in the system prompt besides the think token is needed to get it to fill in the RL block though other text can be added if desired.
The model was found to be highly capable on reasoning tasks when skipping think block.
Running:
Use of speculation with the model is not recommended since it is a moe with 4B activated experts meaning it can be efficiently run on CPU, but CPU does not have enough parallel hardware to leverage processing many tokens in a batch at once. Thus gen rate will go down if speculation is used with experts processed on CPU. If the model is run fully on GPU speculation can be used with the model. A recommended low overhead speculator is gemma-3-270m-it-256k. To use this speculator the inference platform must support dynamic vocab translation between draft and target.
The model can be run fully offloaded into 24G VRAM, or with CPU and expert layer offload via config OT="-ot exps=CPU -ngl 99". Because the model is a 4B active moe the CPU expert offload still gives good gen rate with very large context available, however prompt processing will be very slow for large contexts with experts offloaded to CPU.
On a 9900k/4070 or 2x 4070 setup (1 RPC) approx performance for the Q4_E_H quant is:
| CONFIG (no vision tower) | QKV | NKV | gen tps | pp tps (batch 128) |
|---|---|---|---|---|
| 4070+ 9900k CPU exp offload | F16 | 256k | 23 | ~100 |
| 2x4070 (RPC) | F16 | 128k+ | 70 | ~700 |
| 2x4070 (RPC) + spec , ND=2 | F16 | 128k+ | 77 |
Nothink example (greedy deterministic sampling):
lm "I want to wash my car. The car wash is 50 meters away. Should I walk or drive?"
<|channel>thought
<channel|>Technically, you can do either, but the best choice depends on your current situation:
**Drive if:**
* **You are already in the car:** If you are sitting in the driver's seat, just drive. It’s much easier than getting out, locking the door, walking 50 meters, and then walking back.
* **You have heavy supplies:** If you are carrying buckets, heavy cleaning chemicals, or a vacuum, driving is much more efficient.
* **You are wearing formal clothes:** If you are dressed for work or an event, you might want to avoid the walk.
**Walk if:**
* **You are currently on foot:** If you are walking down the street and see the car wash, it’s easier to walk there and then walk back to your car.
* **You want to save a tiny bit of gas:** 50 meters is about 165 feet. It’s a very short distance, so walking is better for the environment and your fuel tank.
* **You want to stretch your legs:** If you've been sitting for a long time, a 1-minute walk is a good way to move around.
**The Verdict:**
If you are already in the car, **drive**. If you are standing next to the car on the sidewalk, **drive** (it's only 50 meters!).
Think example illustrating introspections kicking in for a significantly improved gen (greedy deterministic sampling):
SYSTEM="<|think|>" SPECIAL=1 lm "I want to wash my car. The car wash is 50 meters away. Should I walk or drive?"
<|channel>thought
* Goal: Wash the car.
* Distance to car wash: 50 meters.
* Question: Should I walk or drive?
* *Option A: Driving*
* Pros: Faster (marginally), less physical effort.
* Cons: Uses fuel/electricity, adds wear and tear to the car, takes time to start/park/maneuver, potentially more pollution, might be overkill for 50m.
* *Option B: Walking*
* Pros: Zero fuel cost, zero wear and tear, good for light exercise, environmentally friendly, easy to park the car *at* the destination and walk back if needed (though usually, you drive to the wash, wash it, and drive away).
* Cons: Takes slightly longer (though 50m is negligible), might be inconvenient if carrying heavy cleaning supplies (but the user is going *to* the wash, implying a facility).
* *Scenario 1: The car is currently dirty.*
* If you drive a dirty car 50 meters, you're just spreading dirt/water on the road.
* If you walk, you save fuel.
* *Scenario 2: The car is currently clean (just going for a wash for maintenance).*
* Driving 50m is a very short trip.
* *Scenario 3: Logistics.*
* If the car wash is a self-service bay, you *must* drive the car there to wash it.
* If it's a drive-through, you *must* drive.
* If it's a hand-wash where you leave the car, you drive there.
* *Wait, the prompt is a bit of a "logic trap" or a "common sense" check.*
* If you are washing the car, the car needs to be *at* the car wash.
* To get the car to the car wash, you **must drive it**.
* *Direct Answer:* You have to drive the car to the car wash (unless you are walking to a car wash to wash a *different* car, which is unlikely).
* *Nuance:* If the question implies "How should I get there to *start* the process?", the answer is still: you drive the car there because the car needs to be washed.
* *Wait, let's look at it from a different angle:* Is this a riddle?
* If you walk, the car stays where it is. If the car stays where it is, it doesn't get washed.
* Therefore, to wash the car, you must drive it.
* *Drafting the response:*
1. **The Logical Answer:** You must drive the car. If you walk, the car stays behind and doesn't get washed.
2. **The Practical Answer:** 50 meters is about a 1-minute walk or a 10-second drive. Since the car needs to be at the wash to be washed, driving is the only option.
* "You should **drive**."
* Reasoning: "If you walk, the car stays where it is, and you won't have a washed car!"
*(Self-Correction: Is there any scenario where walking makes sense? Maybe if you are walking to a car wash to pick up a car you left there? No, the prompt says "I want to wash my car.")*
*Final Conclusion:* You must drive the car to the car wash.
*Wait, let's consider if the user is asking about the "trip" to the car wash.*
If the car is already at the car wash? No, "The car wash is 50 meters away."
If the user is walking to the car wash to *get* the car? No.
*Let's provide a witty but logical answer.*
1. **The Logical Necessity:** You have to drive the car there.
2. **The "Riddle" aspect:** If you walk, the car doesn't move.
*Let's keep it simple.*
"You should **drive**—because if you walk, the car stays where it is, and you won't have a washed car!"<channel|>You should **drive**.
If you walk, the car stays where it is, and you won't have a washed car!
The model passed two long context tests with F16 QKV running on 2x4070 showing usable prompt processing speed for 100k+ token prompts.
Vision:
The model was tested in vision mode on a couple pretty tough bird ID image and found to exhibit poor performance in both think and nonthink mode, but did at least consider the correct bird on one of the test images. As a comparision gemma3 27B went 1 for 2 and Qwen3 27B completely aces these tough (quite blurry images of a small bird) ID tests. The model did a great job on some text based image prompts though.
Code:
The model did a pretty good job on all the code test prompts, however it is not in the same league as the Qwen3 and 3.5 models for code gen.
Llama.cpp inference/isssues:
The model cannot compute valid perplexity due to the instruct tune forcing it to generate
<|channel>thought
as assitant gen independent of previous prompt contents. To work around this problem a modifed perplexity is computed by overwriting the beginning of the perplexity chunk contents with the forced assistent gen as follows:
# chunk is a string of text to eval perplexity on
injects='model\n<|channel>thought\n<channel|>'
chunk="${injects}${chunk:${#injects}}"
logprobs are skipped over the beginning part of the perplexity prompt using a modified llama.cpp downstream server to compute perplexity. Discussion at: https://github.com/ggml-org/llama.cpp/issues/21388#issuecomment-4185889930
Benchmarks:
A full set of both math and vision benchmarks for the model are given here: https://huggingface.co/spaces/steampunque/benchlm
Download the file from below:
| Link | Type | Size/e9 B | Notes |
|---|---|---|---|
| gemma-4-26B-A4B-it-qat-q4_0-unquantized.Q4_P_H.gguf | Q4_P_H | 16.8e9 B | ~ Q4_K_M size, tensors padded to enable K quant use on all layers |
| gemma-4-26B-A4B-it-qat-q4_0-unquantized.mmproj.gguf | F16 | 1.2e9 B | multimedia projector |
A discussion thread about the hybrid layer quant approach can be found here on the llama.cpp git repository:
- Downloads last month
- 229
4-bit
Model tree for steampunque/gemma-4-26B-A4B-it-qat-q4_0-unquantized-MP-GGUF
Base model
google/gemma-4-26B-A4B
Install Unsloth Studio (macOS, Linux, WSL)
# Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for steampunque/gemma-4-26B-A4B-it-qat-q4_0-unquantized-MP-GGUF to start chatting