Instructions to use google/gemma-4-31B-it-qat-w4a16-ct with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use google/gemma-4-31B-it-qat-w4a16-ct with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="google/gemma-4-31B-it-qat-w4a16-ct") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForMultimodalLM processor = AutoProcessor.from_pretrained("google/gemma-4-31B-it-qat-w4a16-ct") model = AutoModelForMultimodalLM.from_pretrained("google/gemma-4-31B-it-qat-w4a16-ct") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use google/gemma-4-31B-it-qat-w4a16-ct with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "google/gemma-4-31B-it-qat-w4a16-ct" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "google/gemma-4-31B-it-qat-w4a16-ct", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/google/gemma-4-31B-it-qat-w4a16-ct
- SGLang
How to use google/gemma-4-31B-it-qat-w4a16-ct with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "google/gemma-4-31B-it-qat-w4a16-ct" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "google/gemma-4-31B-it-qat-w4a16-ct", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "google/gemma-4-31B-it-qat-w4a16-ct" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "google/gemma-4-31B-it-qat-w4a16-ct", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use google/gemma-4-31B-it-qat-w4a16-ct with Docker Model Runner:
docker model run hf.co/google/gemma-4-31B-it-qat-w4a16-ct
Observation on Knowledge Degradation in Official QAT Quants vs. AutoRound
I've been benchmarking the official Gemma-4-31B quants (both GGUF and W4A16) and noticed a significant discrepancy in encyclopedic knowledge retention compared to the FP16 original and the AutoRound (Intel) quantization.
While the QAT (Quantization-Aware Training) seems to preserve general reasoning and "intelligence," there is a noticeable loss in the "long tail" of factual data. For example, the model now makes basic factual errors in names of historical figures (e.g., referring to Christiaan Huygens as "Christopher"), which were absent in the original model.
It seems that the QAT objective may have over-optimized for general perplexity at the expense of factual precision. In contrast, AutoRound (PTQ) provides a much better balance and stays closer to the original model's knowledge base. I highly suggest reviewing the QAT pipeline to prevent this kind of "knowledge erosion" in future releases.
Please note this thread as well:
Don't take me wrong. Gemma4-31B is my favorite model and I appreciate all your efforts to make it better and more affordable.
The Gemma 4 model is more sensitive to quantization than the Qwen 3.5 and 3.6 models. This may be due to its architecture or its higher information density, but I'm not sure. After seeing the accuracy degradation in the QAT versions, I've avoided using them.
I've been testing same prompts with Geamma-4-26B-A4B. Even quantized in GGUF, the smaller model doesn't make such crucial mistakes. I also tested AutoRound quant again. This quant doens't make factual mistakes. AWQ doesn't make them as well. For sure, the model lost its accuracy because of the QAT. Today, I've been trying a special GGUF made by the Unsloth team (https://huggingface.co/unsloth/gemma-4-31B-it-qat-GGUF/). This quant showcases the same inaccuracy in factual data.
Hi all,
Thanks for sharing the feedback. Could you please provide us with some examples or the exact prompts that generated the issue? Having the full prompts will help us for further investigation.
I suspect it might be caused by the quantization script, as there's no such obvious decline in output quality with the unsloth version unsloth/gemma-4-31B-it-qat-GGUF.
To rule out factors possibly caused by GGUF, I tested this version (google/gemma-4-31B-it-qat-w4a16-ct) using vllm.vllm serve --enable-auto-tool-choice --tool-call-parser gemma4 --model $path_to_model_dir --max-model-len 131072 --reasoning-parser gemma4 --language-model-only --chat-template $path_to_model_dir/chat_template.jinja
The model passed correctly for simple knowledge-based questions, such as: 'Introduce EVE Online.'
Or summarizing articles with webfetch tool, and then testing whether the model remembered the content of the article, such as: 'When to use --kv-cache-dtype-skip-layers sliding_window?'
It passed all of these esay tests.
However, the model completely failed at complex writing tasks.
Test content: Providing the model with about 50K tokens of story background information along with the current scenario, asking it to analyze the situation, predict possible behaviors of the characters present, and then write an 800-1200 words paragraph. It was also required to summarize its own written content for easy human reading.
Result: Although it indeed outputs content based on the story background, it’s complete nonsense. Moreover, toward the end of the output, various languages start mixing together, and ultimately it falls into a loop of generating garbage.
A similar phenomenon often happens on lower bit quantized models.
Example:
所有视觉要素最终通过对话推动剧情节点由静态觉醒进入交互阶段结论确认人物已经完全激活处于待命状态完毕总结结束报告提交完成完毕输出终止完结结案总结完了综上所述所有关键动作已执行达成既定目标流程闭环逻辑自恰任务达成输出停止结束所有后续内容不再产生截断点到达在这里截止由此为止最后一句发言终结此处结算全部过程完結完畢จบสิ้น終結END ENDING FINAL CONCLUSION DONE OVER FINISHED TERMINATE STOP STOPPING HALT HALTING BREAK BREAKING CUT OFF CEASE QUIT EXIT LEAVE OUT OFF LINE DISCONNECT DEACTIVATE SHUTDOWN POWEROFF SLEEP HIBERNATE MODE OFF NULL VOID ZERO EMPTY BLANK WHITE SPACE GONE AWAY FAR DISTANCE LONG AGO NEVER MORE NO MORE NONE NOTHING ZIP ZILCH NADA NIL ZERO POINT ZERO PERCENT PERCENTAGE COMPLETE DONE AND DUSTED ALL SET GOOD TO GO READY FOR NEXT STEP WAIT NOT NOW LATER SOME OTHER TIME...