--- license: apache-2.0 license_link: https://ai.google.dev/gemma/docs/gemma_4_license library_name: transformers pipeline_tag: text-generation base_model: google/gemma-4-12B-it-assistant tags: - gemma4 - gemma4-assistant - gemma4-unified - mtp - speculative-decoding - fp8 - modelopt - quantized - vllm --- # Gemma 4 12B-it Assistant — FP8 (ModelOpt) FP8-quantized version of [`google/gemma-4-12B-it-assistant`](https://huggingface.co/google/gemma-4-12B-it-assistant), the Multi-Token Prediction (MTP) drafter that pairs with the Gemma 4 12B-it target for speculative decoding. The drafter is a small 4-layer model; its linear layers are quantized to FP8 (E4M3) with per-tensor static scales via NVIDIA [ModelOpt](https://github.com/NVIDIA/TensorRT-Model-Optimizer). The drafter↔target handshake projections (`pre_projection`, `post_projection`) and `lm_head` stay in BF16. This drafter is **not a standalone text model** — it requires a target model to provide `shared_kv_states` at inference time. Use it as a spec model in vLLM, paired with the 12B-it target. ## Compatible target Pairs with [`bahadirakdemir/gemma-4-12B-it-text-fp8`](https://huggingface.co/bahadirakdemir/gemma-4-12B-it-text-fp8) — the FP8-quantized 12B-it text tower produced by the same pipeline. The FP8 scales were calibrated by running real speculative decoding against the 12B-it target over 32 instruct-style prompts. ## Requirements This drafter and its target use the **unified** Gemma 4 architecture (`gemma4_unified`), which is newer than the classic `gemma4` (e.g. 31B). You need: - **transformers ≥ 5.10.0** - **vLLM with `gemma4_unified` support** — at the time of writing this is on the `main` branch / nightly (`uv pip install -U vllm --pre`), not yet in a tagged stable release (≤ 0.22.0). It will be in the next stable release. ## Usage with vLLM ```bash vllm serve bahadirakdemir/gemma-4-12B-it-text-fp8 \ --quantization modelopt \ --max-model-len 8192 \ --max-num-batched-tokens 8192 \ --gpu-memory-utilization 0.5 \ --limit-mm-per-prompt '{"image": 0, "audio": 0}' \ --speculative-config '{"model": "bahadirakdemir/gemma-4-12B-it-assistant-fp8", "num_speculative_tokens": 4}' ``` Tested with `vllm/vllm-openai:gemma4-0505-arm64-cu130` on NVIDIA GB10. This is the 12B counterpart of [`bahadirakdemir/gemma-4-31B-it-assistant-fp8`](https://huggingface.co/bahadirakdemir/gemma-4-31B-it-assistant-fp8), produced by the same pipeline. License: Apache 2.0, inherited from upstream Gemma 4 — see [the Gemma 4 license](https://ai.google.dev/gemma/docs/gemma_4_license).