Gemma 4 E4B IT QAT Assistant GGUF

GGUF conversions of Google's official unquantized QAT assistant/drafter checkpoint for Gemma 4 E4B IT.

Google publishes GGUFs for the main QAT models, but the assistant/drafter checkpoint is published as unquantized QAT safetensors. This repo packages the E4B assistant as GGUF for llama.cpp speculative decoding.

Base model: google/gemma-4-E4B-it-qat-q4_0-unquantized-assistant

These files were converted with llama.cpp commit 961e9a3e46ca4cf7e6e86cfceb5b5e32084bf5f0.

The QAT assistant GGUFs also appear to work fine with regular, non-QAT Gemma 4 E4B IT target models in llama.cpp.

Usage

llama-server \
  -m gemma-4-E4B-it.gguf \
  --model-draft gemma-4-E4B-it-qat-assistant-q4_k_m.gguf \
  --spec-type draft-mtp \
  --spec-draft-n-max 3

Files

  • gemma-4-E4B-it-qat-assistant-bf16.gguf
  • gemma-4-E4B-it-qat-assistant-q8_0.gguf
  • gemma-4-E4B-it-qat-assistant-q4_k_m.gguf
  • gemma-4-E4B-it-qat-assistant-q4_0.gguf

Conversion

BF16 source GGUF:

python convert_hf_to_gguf.py \
  google/gemma-4-E4B-it-qat-q4_0-unquantized-assistant \
  --outfile gemma-4-E4B-it-qat-assistant-bf16.gguf \
  --outtype bf16

Q4_K_M and Q4_0 were quantized directly from the BF16 GGUF:

llama-quantize gemma-4-E4B-it-qat-assistant-bf16.gguf gemma-4-E4B-it-qat-assistant-q4_k_m.gguf q4_k_m
llama-quantize gemma-4-E4B-it-qat-assistant-bf16.gguf gemma-4-E4B-it-qat-assistant-q4_0.gguf q4_0

Q8_0 was exported directly from the Hugging Face checkpoint:

python convert_hf_to_gguf.py \
  google/gemma-4-E4B-it-qat-q4_0-unquantized-assistant \
  --outfile gemma-4-E4B-it-qat-assistant-q8_0.gguf \
  --outtype q8_0
Downloads last month
834
GGUF
Model size
78M params
Architecture
gemma4-assistant
Hardware compatibility
Log In to add your hardware

4-bit

8-bit

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for cascade-tech/gemma-4-E4B-it-qat-q4_0-unquantized-assistant-gguf