---
license: apache-2.0
license_link: https://ai.google.dev/gemma/docs/gemma_4_license
library_name: transformers
pipeline_tag: text-generation
base_model: google/gemma-4-12B-it-assistant
tags:
  - gemma4
  - gemma4-assistant
  - gemma4-unified
  - mtp
  - speculative-decoding
  - fp8
  - modelopt
  - quantized
  - vllm
---

# Gemma 4 12B-it Assistant — FP8 (ModelOpt)

FP8-quantized version of [`google/gemma-4-12B-it-assistant`](https://huggingface.co/google/gemma-4-12B-it-assistant),
the Multi-Token Prediction (MTP) drafter that pairs with the Gemma 4 12B-it
target for speculative decoding. The drafter is a small 4-layer model; its
linear layers are quantized to FP8 (E4M3) with per-tensor static scales via
NVIDIA [ModelOpt](https://github.com/NVIDIA/TensorRT-Model-Optimizer). The
drafter↔target handshake projections (`pre_projection`, `post_projection`) and
`lm_head` stay in BF16.

This drafter is **not a standalone text model** — it requires a target model
to provide `shared_kv_states` at inference time. Use it as a spec model in
vLLM, paired with the 12B-it target.

## Compatible target

Pairs with
[`bahadirakdemir/gemma-4-12B-it-text-fp8`](https://huggingface.co/bahadirakdemir/gemma-4-12B-it-text-fp8) —
the FP8-quantized 12B-it text tower produced by the same pipeline. The FP8
scales were calibrated by running real speculative decoding against the 12B-it
target over 32 instruct-style prompts.

## Requirements

This drafter and its target use the **unified** Gemma 4 architecture
(`gemma4_unified`), which is newer than the classic `gemma4` (e.g. 31B). You need:

- **transformers ≥ 5.10.0**
- **vLLM with `gemma4_unified` support** — at the time of writing this is on the
  `main` branch / nightly (`uv pip install -U vllm --pre`), not yet in a tagged
  stable release (≤ 0.22.0). It will be in the next stable release.

## Usage with vLLM

```bash
vllm serve bahadirakdemir/gemma-4-12B-it-text-fp8 \
  --quantization modelopt \
  --max-model-len 8192 \
  --max-num-batched-tokens 8192 \
  --gpu-memory-utilization 0.5 \
  --limit-mm-per-prompt '{"image": 0, "audio": 0}' \
  --speculative-config '{"model": "bahadirakdemir/gemma-4-12B-it-assistant-fp8", "num_speculative_tokens": 4}'
```

Tested with `vllm/vllm-openai:gemma4-0505-arm64-cu130` on NVIDIA GB10.

This is the 12B counterpart of
[`bahadirakdemir/gemma-4-31B-it-assistant-fp8`](https://huggingface.co/bahadirakdemir/gemma-4-31B-it-assistant-fp8),
produced by the same pipeline.

License: Apache 2.0, inherited from upstream Gemma 4 — see
[the Gemma 4 license](https://ai.google.dev/gemma/docs/gemma_4_license).