Instructions to use bahadirakdemir/gemma-4-12B-it-assistant-fp8 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use bahadirakdemir/gemma-4-12B-it-assistant-fp8 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="bahadirakdemir/gemma-4-12B-it-assistant-fp8")# Load model directly from transformers import AutoTokenizer, AutoModelForMultimodalLM tokenizer = AutoTokenizer.from_pretrained("bahadirakdemir/gemma-4-12B-it-assistant-fp8") model = AutoModelForMultimodalLM.from_pretrained("bahadirakdemir/gemma-4-12B-it-assistant-fp8") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use bahadirakdemir/gemma-4-12B-it-assistant-fp8 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "bahadirakdemir/gemma-4-12B-it-assistant-fp8" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "bahadirakdemir/gemma-4-12B-it-assistant-fp8", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/bahadirakdemir/gemma-4-12B-it-assistant-fp8
- SGLang
How to use bahadirakdemir/gemma-4-12B-it-assistant-fp8 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "bahadirakdemir/gemma-4-12B-it-assistant-fp8" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "bahadirakdemir/gemma-4-12B-it-assistant-fp8", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "bahadirakdemir/gemma-4-12B-it-assistant-fp8" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "bahadirakdemir/gemma-4-12B-it-assistant-fp8", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use bahadirakdemir/gemma-4-12B-it-assistant-fp8 with Docker Model Runner:
docker model run hf.co/bahadirakdemir/gemma-4-12B-it-assistant-fp8
Gemma 4 12B-it Assistant — FP8 (ModelOpt)
FP8-quantized version of google/gemma-4-12B-it-assistant,
the Multi-Token Prediction (MTP) drafter that pairs with the Gemma 4 12B-it
target for speculative decoding. The drafter is a small 4-layer model; its
linear layers are quantized to FP8 (E4M3) with per-tensor static scales via
NVIDIA ModelOpt. The
drafter↔target handshake projections (pre_projection, post_projection) and
lm_head stay in BF16.
This drafter is not a standalone text model — it requires a target model
to provide shared_kv_states at inference time. Use it as a spec model in
vLLM, paired with the 12B-it target.
Compatible target
Pairs with
bahadirakdemir/gemma-4-12B-it-text-fp8 —
the FP8-quantized 12B-it text tower produced by the same pipeline. The FP8
scales were calibrated by running real speculative decoding against the 12B-it
target over 32 instruct-style prompts.
Requirements
This drafter and its target use the unified Gemma 4 architecture
(gemma4_unified), which is newer than the classic gemma4 (e.g. 31B). You need:
- transformers ≥ 5.10.0
- vLLM with
gemma4_unifiedsupport — at the time of writing this is on themainbranch / nightly (uv pip install -U vllm --pre), not yet in a tagged stable release (≤ 0.22.0). It will be in the next stable release.
Usage with vLLM
vllm serve bahadirakdemir/gemma-4-12B-it-text-fp8 \
--quantization modelopt \
--max-model-len 8192 \
--max-num-batched-tokens 8192 \
--gpu-memory-utilization 0.5 \
--limit-mm-per-prompt '{"image": 0, "audio": 0}' \
--speculative-config '{"model": "bahadirakdemir/gemma-4-12B-it-assistant-fp8", "num_speculative_tokens": 4}'
Tested with vllm/vllm-openai:gemma4-0505-arm64-cu130 on NVIDIA GB10.
This is the 12B counterpart of
bahadirakdemir/gemma-4-31B-it-assistant-fp8,
produced by the same pipeline.
License: Apache 2.0, inherited from upstream Gemma 4 — see the Gemma 4 license.
- Downloads last month
- 157
Model tree for bahadirakdemir/gemma-4-12B-it-assistant-fp8
Base model
google/gemma-4-12B-it-assistant