DiffusionGemma 26B A4B IT - W4A16 Compressed Tensors

This is a 4-bit W4A16 compressed-tensors quantization of google/diffusiongemma-26B-A4B-it for vLLM.

Quantization

Format: compressed-tensors
Weight bits: 4-bit integer
Activation dtype: BF16
Strategy: grouped weight-only quantization
Group size: 64
Calibration samples: 32
Calibration sequence length: 1024
AWQ smoothing: MLP mappings only
Attention and MoE expert weights are packed W4A16, but attention/expert AWQ smoothing was not used.

Attention AWQ smoothing was disabled because llm-compressor's smoothing replay hit DiffusionGemma's block-diffusion attention-mask path. Full per-expert AWQ smoothing was also disabled because it required too much host RAM for the quantization machine. The final checkpoint still contains packed W4A16 weights for attention and MoE expert linear layers.

Verification

Validated with vllm/vllm-openai:gemma on an NVIDIA RTX A6000.

vLLM loaded the checkpoint as quantization=compressed-tensors, selected WNA16 Marlin/Humming kernels and the Marlin MoE backend, and generated successfully from a chat-template prompt:

Hello, how can I help you today?

The checkpoint loaded at about 15.14 GiB model memory on the A6000 in the local verification run.

Example vLLM Usage

vllm serve pixelkaiser/diffusiongemma-26B-A4B-it-AWQ-MLP-W4A16-G64-S32-L1024 \
  --host 0.0.0.0 \
  --port 8000 \
  --max-model-len 4096 \
  --max-num-seqs 1 \
  --gpu-memory-utilization 0.75 \
  --generation-config vllm \
  --reasoning-parser gemma4 \
  --default-chat-template-kwargs '{"enable_thinking": true}' \
  --hf-overrides '{"diffusion_sampler":"entropy_bound","diffusion_entropy_bound":0.1}' \
  --diffusion-config '{"canvas_length":256}'

Tune --max-num-seqs, --max-model-len, and --gpu-memory-utilization for your GPU and workload.

Downloads last month: 2,711

Safetensors

Model size

29B params

Tensor type

I64

I32

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for pixelkaiser/diffusiongemma-26B-A4B-it-AWQ-MLP-W4A16-G64-S32-L1024

Base model

google/diffusiongemma-26B-A4B-it

Quantized

(26)

this model