Instructions to use nohurry/gemma-4-26B-A4B-it-heretic-GUFF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use nohurry/gemma-4-26B-A4B-it-heretic-GUFF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="nohurry/gemma-4-26B-A4B-it-heretic-GUFF",
	filename="gemma-4-26B-A4B-it-heretic-mmproj.bf16.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": [
				{
					"type": "text",
					"text": "Describe this image in one sentence."
				},
				{
					"type": "image_url",
					"image_url": {
						"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
					}
				}
			]
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use nohurry/gemma-4-26B-A4B-it-heretic-GUFF with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf nohurry/gemma-4-26B-A4B-it-heretic-GUFF:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf nohurry/gemma-4-26B-A4B-it-heretic-GUFF:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf nohurry/gemma-4-26B-A4B-it-heretic-GUFF:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf nohurry/gemma-4-26B-A4B-it-heretic-GUFF:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf nohurry/gemma-4-26B-A4B-it-heretic-GUFF:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf nohurry/gemma-4-26B-A4B-it-heretic-GUFF:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf nohurry/gemma-4-26B-A4B-it-heretic-GUFF:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf nohurry/gemma-4-26B-A4B-it-heretic-GUFF:Q4_K_M

Use Docker

docker model run hf.co/nohurry/gemma-4-26B-A4B-it-heretic-GUFF:Q4_K_M

LM Studio
Jan

vLLM

How to use nohurry/gemma-4-26B-A4B-it-heretic-GUFF with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "nohurry/gemma-4-26B-A4B-it-heretic-GUFF"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "nohurry/gemma-4-26B-A4B-it-heretic-GUFF",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/nohurry/gemma-4-26B-A4B-it-heretic-GUFF:Q4_K_M

Ollama
How to use nohurry/gemma-4-26B-A4B-it-heretic-GUFF with Ollama:
```
ollama run hf.co/nohurry/gemma-4-26B-A4B-it-heretic-GUFF:Q4_K_M
```

Unsloth Studio

How to use nohurry/gemma-4-26B-A4B-it-heretic-GUFF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for nohurry/gemma-4-26B-A4B-it-heretic-GUFF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for nohurry/gemma-4-26B-A4B-it-heretic-GUFF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for nohurry/gemma-4-26B-A4B-it-heretic-GUFF to start chatting

How to use nohurry/gemma-4-26B-A4B-it-heretic-GUFF with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf nohurry/gemma-4-26B-A4B-it-heretic-GUFF:Q4_K_M

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "nohurry/gemma-4-26B-A4B-it-heretic-GUFF:Q4_K_M"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use nohurry/gemma-4-26B-A4B-it-heretic-GUFF with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf nohurry/gemma-4-26B-A4B-it-heretic-GUFF:Q4_K_M

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default nohurry/gemma-4-26B-A4B-it-heretic-GUFF:Q4_K_M

Run Hermes

hermes

Docker Model Runner
How to use nohurry/gemma-4-26B-A4B-it-heretic-GUFF with Docker Model Runner:
```
docker model run hf.co/nohurry/gemma-4-26B-A4B-it-heretic-GUFF:Q4_K_M
```

Lemonade

How to use nohurry/gemma-4-26B-A4B-it-heretic-GUFF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull nohurry/gemma-4-26B-A4B-it-heretic-GUFF:Q4_K_M

Run and chat with the model

lemonade run user.gemma-4-26B-A4B-it-heretic-GUFF-Q4_K_M

List all available models

lemonade list

Quants of https://huggingface.co/coder3101/gemma-4-26B-A4B-it-heretic, using Unsloth's imatrix.

Not sure how many different quants I can upload as I am severely constrained by my PC's storage space.

Usage

Google recommends the following sampler settings:

temperature = 1.0
top_k = 64
top_p = 0.95
min_p = 0.0

For creative writing, I use:

temperature = 1.0
top_k = 0
top_p = 1.0
min_p = 0.05
top-n-sigma = 1.0
adaptive-target = 0.7
adaptive-decay = 0.9

For the image encoder:

image-min-tokens: 70
image-max-tokens: 1120

While it's reasoning is not excessive, sometimes I do want to limit it:

predict: 16384
reasoning-budget: 8192
reasoning-budget-message: "... I think I've explored this enough, time to respond."

Within llama.cpp and koboldcpp, ensure that --swa-full is enabled as this model uses Sliding Window Attention (SWA).

Thinking

In order to enable thinking, add <|think|> at the top of your system prompt:

<|think|>
You are a helpful assistant.

Conversely, to disable thinking simply omit <|think|> from your system prompt:

You are a helpful assistant.

To parse the thinking, use the following in SillyTavern or your platform of choice:

prefix: <|channel>thought
postfix: <channel|>

Reproduction

You can read REPRODUCE.md in the repo's "files and versions" to see how I made the quants. mmproj files are also located there.

FAQ

What quant should I use?

The largest model that fits inside your VRAM.

Context GBs: context / 8k = VRAM usage. So 16384 / 8192 = 2GB

As an example to estimate VRAM cost:

+ Q3_K_M model (12,9 GB)
+ 16K context (2 GB)
+ Q8_0 mmproj (800 MB)
= ~15,7 GB usage.

In case you want to mainly do OCR tasks, prefer a lower text model quant and a higher mmproj quant (bf16/f16/f32) as encoders are far more sensitive to quantization.

F16 vs BF16

BF16's format allows for storing weights in higher accuracy.

Do you have support for BF16 acceleration? If yes, use BF16.

These are the following indicators:

AVX512-BF16
SPV_KHR_bfloat16
NVIDIA RTX 30 series or newer
AMD RaDEON RX 7000 series or newer
Intel Xe A series or newer

If you like to thinker (like me) by running LLMs on an Intel N5000 CPU with Intel UHD Graphics 605 over Vulkan 1.3? F16 is going to run better.

Why don't you use the imatrix for the q8_0 quant?

As explained by the wonderful team mradermacher:

Q8_0 imatrix quants do not exist - some quanters claim otherwise, but Q8_0 ggufs do not contain any tensor type that uses the imatrix data, although technically it might be possible to do so.
--- https://huggingface.co/mradermacher/model_requests

Downloads last month: 10,621

GGUF

Model size

25B params

Architecture

gemma4

Hardware compatibility

3-bit

4-bit

5-bit

6-bit

8-bit

16-bit

32-bit

Model tree for nohurry/gemma-4-26B-A4B-it-heretic-GUFF

Base model

google/gemma-4-26B-A4B

Finetuned

google/gemma-4-26B-A4B-it

Finetuned

coder3101/gemma-4-26B-A4B-it-heretic

Quantized

(19)

this model