---
license: mit
base_model:
- google/gemma-4-26B-A4B-it-qat-q4_0-unquantized-assistant
tags:
- gguf
- QAT
- mtp
- gemma
---

# Gemma 4 26B A4B Assistant GGUF

GGUF quantizations converted from `google/gemma-4-26B-A4B-it-qat-q4_0-unquantized-assistant`.

Tested with llama.cpp b9549 (Gemma 4 MTP support).

# Update

Added experimental IQ quantizations with Q4 embeddings (token_embd.weight = Q4_0).

### Recommendations

- `Q4_0-q4emb` — recommended for most users
- `Q8_0` — for users with spare VRAM


## Files
* `gemma-4-26B-A4B-it-assistant-f16.gguf`
* `gemma-4-26b-A4B-it-assistant-Q4_0.gguf`
* `gemma-4-26b-A4B-it-assistant-Q4_0-q4emb.gguf` (closest to pure Q4 QAT layout)
* `gemma-4-26b-A4B-it-assistant-IQ4_NL-q4emb.gguf`
* `gemma-4-26b-A4B-it-assistant-IQ3_M-q4emb.gguf` (smallest that still works)
* `gemma-4-26b-A4B-it-assistant-Q8_0.gguf`

### Q4 Embedding Variant

`Q4_0-q4emb` is an experimental quantization where `token_embd.weight` is kept in `Q4_0` instead of `Q6_K` precision quantization typically used by llama.cpp.

This follows a similar approach to recent QAT experiments for Gemma models, where preserving the original Q4-trained embedding format may better match the intended QAT behavior.

Initial testing showed similar draft acceptance rates to the default Q4_0 quant, with a small speed advantage, though more benchmarking is needed.


## Example

```bash
llama-server \
  -m gemma-4-26B-A4B-it-qat-UD-Q4_K_XL.gguf \
  -md gemma-4-26b-A4B-it-assistant-Q4_0.gguf \
  --spec-type draft-mtp \
  --spec-draft-n-max 2
```

Recommended values:

* `--spec-draft-n-max 2` for general use
* `--spec-draft-n-max 3` for coding workloads