How to use from
vLLM
Install from pip and serve model
# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "prithivMLmods/gemma-4-31B-it-F32-GGUF"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "prithivMLmods/gemma-4-31B-it-F32-GGUF",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'
Use Docker
docker model run hf.co/prithivMLmods/gemma-4-31B-it-F32-GGUF:
Quick Links

gemma-4-31B-it-F32-GGUF

Gemma-4-31B-it from Google is the flagship dense model in the Gemma 4 family, featuring 31 billion parameters optimized for workstation/server deployment with a massive 256K context window, support for text, images (variable aspect ratios/resolutions), and advanced agentic capabilities including step-by-step thinking modes, multilingual OCR/handwriting recognition, document/PDF parsing, UI/screen analysis, chart comprehension, and precise object detection with pointing. Designed to bridge edge and cloud performance, the instruction-tuned variant delivers frontier-level reasoning rivaling proprietary models 5-10x larger across coding, math, multilingual tasks (140+ languages), and multimodal workflows while maintaining Google's production-grade safety alignments for enterprise use. With Apache 2.0 licensing and optimizations for NVIDIA/AMD GPUs via vLLM/llama.cpp, it powers high-quality local inference on consumer hardware for autonomous agents, function calling, structured data extraction, and complex planning without cloud dependency—positioned above efficient MoE siblings for maximum output quality in reasoning-heavy applications.

Quick start with llama.cpp

llama-server -hf prithivMLmods/gemma-4-31B-it-F32-GGUF:F32

Model Files

File Name Quant Type File Size File Link
gemma-4-31B-it.BF16.gguf BF16 61.4 GB Download
gemma-4-31B-it.F16.gguf F16 61.4 GB Download
gemma-4-31B-it.F32.gguf F32 123 GB Download
gemma-4-31B-it.Q8_0.gguf Q8_0 32.6 GB Download
gemma-4-31B-it.mmproj-bf16.gguf mmproj-bf16 1.2 GB Download
gemma-4-31B-it.mmproj-f16.gguf mmproj-f16 1.2 GB Download
gemma-4-31B-it.mmproj-f32.gguf mmproj-f32 2.3 GB Download
gemma-4-31B-it.mmproj-q8_0.gguf mmproj-q8_0 810 MB Download

Quants Usage

(sorted by size, not necessarily quality. IQ-quants are often preferable over similar sized non-IQ quants)

Here is a handy graph by ikawrakow comparing some lower-quality quant types (lower is better):

image.png

Downloads last month
637
GGUF
Model size
31B params
Architecture
gemma4
Hardware compatibility
Log In to add your hardware

8-bit

16-bit

32-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for prithivMLmods/gemma-4-31B-it-F32-GGUF

Quantized
(242)
this model

Collection including prithivMLmods/gemma-4-31B-it-F32-GGUF