DiffusionGemma 26B A4B IT - W4A16 Compressed Tensors
This is a 4-bit W4A16 compressed-tensors quantization of
google/diffusiongemma-26B-A4B-it
for vLLM.
Quantization
- Format:
compressed-tensors - Weight bits: 4-bit integer
- Activation dtype: BF16
- Strategy: grouped weight-only quantization
- Group size: 64
- Calibration samples: 32
- Calibration sequence length: 1024
- AWQ smoothing: MLP mappings only
- Attention and MoE expert weights are packed W4A16, but attention/expert AWQ smoothing was not used.
Attention AWQ smoothing was disabled because llm-compressor's smoothing replay hit DiffusionGemma's block-diffusion attention-mask path. Full per-expert AWQ smoothing was also disabled because it required too much host RAM for the quantization machine. The final checkpoint still contains packed W4A16 weights for attention and MoE expert linear layers.
Verification
Validated with vllm/vllm-openai:gemma on an NVIDIA RTX A6000.
vLLM loaded the checkpoint as quantization=compressed-tensors, selected WNA16
Marlin/Humming kernels and the Marlin MoE backend, and generated successfully
from a chat-template prompt:
Hello, how can I help you today?
The checkpoint loaded at about 15.14 GiB model memory on the A6000 in the local verification run.
Example vLLM Usage
vllm serve pixelkaiser/diffusiongemma-26B-A4B-it-AWQ-MLP-W4A16-G64-S32-L1024 \
--host 0.0.0.0 \
--port 8000 \
--max-model-len 4096 \
--max-num-seqs 1 \
--gpu-memory-utilization 0.75 \
--generation-config vllm \
--reasoning-parser gemma4 \
--default-chat-template-kwargs '{"enable_thinking": true}' \
--hf-overrides '{"diffusion_sampler":"entropy_bound","diffusion_entropy_bound":0.1}' \
--diffusion-config '{"canvas_length":256}'
Tune --max-num-seqs, --max-model-len, and --gpu-memory-utilization for
your GPU and workload.
- Downloads last month
- 2,711
Model tree for pixelkaiser/diffusiongemma-26B-A4B-it-AWQ-MLP-W4A16-G64-S32-L1024
Base model
google/diffusiongemma-26B-A4B-it