How to use from
llama.cpp
Install (macOS, Linux)
curl -LsSf https://llama.app/install.sh | sh
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf Alittlehammmer/gemma-4-26B-A4B-it-DFlash-GGUF-llama.cpp:
# Run inference directly in the terminal:
llama cli -hf Alittlehammmer/gemma-4-26B-A4B-it-DFlash-GGUF-llama.cpp:
Install from WinGet (Windows)
winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf Alittlehammmer/gemma-4-26B-A4B-it-DFlash-GGUF-llama.cpp:
# Run inference directly in the terminal:
llama cli -hf Alittlehammmer/gemma-4-26B-A4B-it-DFlash-GGUF-llama.cpp:
Use pre-built binary
# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf Alittlehammmer/gemma-4-26B-A4B-it-DFlash-GGUF-llama.cpp:
# Run inference directly in the terminal:
./llama-cli -hf Alittlehammmer/gemma-4-26B-A4B-it-DFlash-GGUF-llama.cpp:
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf Alittlehammmer/gemma-4-26B-A4B-it-DFlash-GGUF-llama.cpp:
# Run inference directly in the terminal:
./build/bin/llama-cli -hf Alittlehammmer/gemma-4-26B-A4B-it-DFlash-GGUF-llama.cpp:
Use Docker
docker model run hf.co/Alittlehammmer/gemma-4-26B-A4B-it-DFlash-GGUF-llama.cpp:
Quick Links

Gemma-4-26B-A4B-it-DFlash

GGUF quantizations of z-lab/gemma-4-26B-A4B-it-DFlash.

Converted to BF16 using convert_hf_to_gguf.py, then quantized using llama-quantize from llama.cpp.

Available quants

Quant Bits Size Notes
Q4_K_M 4 226 MB Average quality
Q5_K 5 315 MB High quality
Q6_K 6 367 MB Very high quality
Q8_0 8 471 MB Highest quality, near lossless, Recommended
BF16 16 874 MB Full precision, reference file

Usage

Use in conjunction with existing Gemma 4 Quants, example config if using llama-server:

[Gemma-4-26B-A4B-it-DFlash]
sm = layer
model = /mnt/gguf/Gemma-4-26B-A4B-it/Gemma-4-26B-A4B-it-Q8_0.gguf
model-draft = /mnt/gguf/Gemma-4-26B-A4B-it-DFlash/Gemma-4-26B-A4B-it-DFlash-Q8_0.gguf
spec-type = draft-dflash
spec-draft-n-max = 6 

(Note: For some reason I cannot get sm = tensor to work, it crashes on launch, pretty sure this is an issue in llama.cpp)

Original model

See the original model card for details on capabilities, benchmarks, and license.

Downloads last month
-
GGUF
Model size
0.4B params
Architecture
dflash
Hardware compatibility
Log In to add your hardware

4-bit

6-bit

8-bit

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Alittlehammmer/gemma-4-26B-A4B-it-DFlash-GGUF-llama.cpp

Quantized
(4)
this model

Collection including Alittlehammmer/gemma-4-26B-A4B-it-DFlash-GGUF-llama.cpp