metadata
base_model: google/gemma-4-26B-A4B-it
tags:
- gemma4
- multimodal
- image-text-to-text
- llm-compressor
- compressed-tensors
- gptq
- int4
- vllm
gemma-4-26B-A4B-it-W4A16
This repository contains a llm-compressor GPTQ export of google/gemma-4-26B-A4B-it using 4-bit weight-only quantization with bf16 activations.
What This Is
- Base model:
google/gemma-4-26B-A4B-it - Export format:
compressed-tensors/pack-quantized - Quantization style: W4A16 GPTQ
- Weight quantization: 4-bit signed integer, symmetric, grouped
- Weight group size:
64 - Modalities: text and image
This checkpoint uses group_size=64 intentionally. Gemma 4 26B A4B contains MoE down_proj widths such as 704 and 2112, which are not divisible by 128, so a default W4A16 G128 export is not valid for these layers.
Calibration Setup
- Text calibration baseline:
HuggingFaceH4/ultrachat_200k - Image calibration baseline:
lmms-lab/flickr30k - Calibration mode: mixed text/image
vLLM Serving
Serve Command (Works after "[Gemma4] Support quantized MoE" commit (3aecdf08b4a896a92e2cbd11c3d5a83d3c09abc1))
vllm serve dhruvil237/gemma-4-26B-A4B-it-W4A16 \
--gpu-memory-utilization 0.8 \
--reasoning-parser gemma4 \
--dtype float16