--- license: mit base_model: - google/gemma-4-26B-A4B-it-qat-q4_0-unquantized-assistant tags: - gguf - QAT - mtp - gemma --- # Gemma 4 26B A4B Assistant GGUF GGUF quantizations converted from `google/gemma-4-26B-A4B-it-qat-q4_0-unquantized-assistant`. Tested with llama.cpp b9549 (Gemma 4 MTP support). # Update Added experimental IQ quantizations with Q4 embeddings (token_embd.weight = Q4_0). ### Recommendations - `Q4_0-q4emb` — recommended for most users - `Q8_0` — for users with spare VRAM ## Files * `gemma-4-26B-A4B-it-assistant-f16.gguf` * `gemma-4-26b-A4B-it-assistant-Q4_0.gguf` * `gemma-4-26b-A4B-it-assistant-Q4_0-q4emb.gguf` (closest to pure Q4 QAT layout) * `gemma-4-26b-A4B-it-assistant-IQ4_NL-q4emb.gguf` * `gemma-4-26b-A4B-it-assistant-IQ3_M-q4emb.gguf` (smallest that still works) * `gemma-4-26b-A4B-it-assistant-Q8_0.gguf` ### Q4 Embedding Variant `Q4_0-q4emb` is an experimental quantization where `token_embd.weight` is kept in `Q4_0` instead of `Q6_K` precision quantization typically used by llama.cpp. This follows a similar approach to recent QAT experiments for Gemma models, where preserving the original Q4-trained embedding format may better match the intended QAT behavior. Initial testing showed similar draft acceptance rates to the default Q4_0 quant, with a small speed advantage, though more benchmarking is needed. ## Example ```bash llama-server \ -m gemma-4-26B-A4B-it-qat-UD-Q4_K_XL.gguf \ -md gemma-4-26b-A4B-it-assistant-Q4_0.gguf \ --spec-type draft-mtp \ --spec-draft-n-max 2 ``` Recommended values: * `--spec-draft-n-max 2` for general use * `--spec-draft-n-max 3` for coding workloads