Gemma 4 MTP Draft Models — GGUF Q8_0

This repository contains GGUF-converted MTP / assistant draft models for Gemma 4 speculative decoding in llama.cpp.

These files are not standalone chat models. They are draft / MTP models intended to be loaded alongside the matching Gemma 4 target model using llama.cpp speculative decoding.

Files

File Source checkpoint Source revision GGUF type SHA256
mtp-gemma-4-31B-it.gguf google/gemma-4-31B-it-qat-q4_0-unquantized-assistant 5db7ebef2cfabfc4b5b0dc898171d854b1521b14 Q8_0 c1a08236603ff83678494d491a80b941b5ed5dfed035ce778d662dfc435d832d
mtp-gemma-4-12B-it.gguf google/gemma-4-12B-it-qat-q4_0-unquantized-assistant 9b957049a807d0ce8d7682d0f308a2df835c3f9a Q8_0 13331068b6af643c3dc75e619373b674c1f75a1958e7c82e2020d96a17c63809

What these are

Google's source checkpoints are unquantized QAT assistant checkpoints. They contain half-precision weights extracted from the QAT pipeline for Gemma 4 MTP / assistant drafting.

This repo provides those assistant checkpoints converted to GGUF and quantized as Q8_0 for use with llama.cpp's draft-mtp speculative decoding path.

The target model itself is not included in this repository. Pair these files with the matching Gemma 4 IT GGUF target model.

Conversion approach

Converted with llama.cpp's Hugging Face to GGUF converter:

python convert_hf_to_gguf.py \
  --outtype q8_0 \
  --outfile mtp-gemma-4-31B-it.gguf \
  /path/to/google/gemma-4-31B-it-qat-q4_0-unquantized-assistant

python convert_hf_to_gguf.py \
  --outtype q8_0 \
  --outfile mtp-gemma-4-12B-it.gguf \
  /path/to/google/gemma-4-12B-it-qat-q4_0-unquantized-assistant

Conversion environment:

  • llama.cpp commit: 7d2b45b4f
  • GGUF v3
  • Architecture: gemma4-assistant
  • Output type: Q8_0
  • 49 tensors per model

A tokenizer config compatibility fix was applied before conversion because the local Transformers version errored on:

"extra_special_tokens": []

The value was changed to an empty object for conversion compatibility:

"extra_special_tokens": {}

No model weights were otherwise modified beyond GGUF conversion and Q8_0 quantization.

Usage with llama.cpp

Example for Gemma 4 31B:

llama-server \
  -m /path/to/gemma-4-31B_q4_0-it.gguf \
  -md /path/to/mtp-gemma-4-31B-it.gguf \
  --spec-type draft-mtp \
  --spec-draft-n-max 4 \
  --kv-unified \
  -fa on

For very large context sizes, the target model and KV cache may consume nearly all GPU memory before the draft model is loaded. If draft loading fails, place the draft model on CPU:

llama-server \
  -m /path/to/gemma-4-31B_q4_0-it.gguf \
  -md /path/to/mtp-gemma-4-31B-it.gguf \
  --spec-type draft-mtp \
  --spec-draft-ngl 0 \
  --spec-draft-n-max 4 \
  --kv-unified \
  -fa on

For example, with a 32 GB GPU and full 262k context, --spec-draft-ngl 0 may be needed. Reducing context length can allow the draft model to remain on GPU.

Notes

  • Requires a llama.cpp build with Gemma 4 MTP / draft-mtp support.
  • The draft model should be paired with the matching target model family and size.
  • These files are intended for speculative decoding acceleration, not direct prompting.
  • The 31B assistant checkpoint has n_ctx_train = 131072; using it with a 262k target may emit a context warning for the draft context.

License

The source Gemma 4 models are provided by Google under the Apache 2.0 license. See the original model cards and license terms:

Downloads last month
562
GGUF
Model size
0.4B params
Architecture
gemma4-assistant
Hardware compatibility
Log In to add your hardware

We're not able to determine the quantization variants.

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Pulsate1680/gemma-4-mtp-q8_0-gguf