Qwen3-Omni-30B-A3B-Instruct — NVFP4 W4A16 (AWQ-clip)

ModelOpt NVFP4 W4A16 quantization of Qwen/Qwen3-Omni-30B-A3B-Instruct, calibrated with NVIDIA TensorRT-Model-Optimizer's W4A16_NVFP4_CFG + awq_clip algorithm.

This checkpoint keeps multimodal capability intact while shrinking the thinker LM by ~57% on disk; multimodal encoders, talker, code2wav remain BF16.

Accuracy vs BF16 baseline

Benchmarked on RTX PRO 6000 Blackwell WS (sm_120), 0-shot greedy, seed=42. Daily-Omni accuracy uses the canonical vllm-omni in-tree harness.

Headline

Benchmark BF16 NVFP4 W4A16 awq Δ
MMLU (n=100) 0.72 0.72 0.00
GSM8K (n=100) 0.95 0.94 −0.01
Daily-Omni overall (n=50) 0.72 0.72 0.00

Within statistical noise on all three. Passes Daily-Omni's default 0.67 accept gate.

Daily-Omni by question type

Task type BF16 NVFP4 W4A16 awq n
Reasoning 1.000 1.000 11
Inference 1.000 0.750 4
AV Event Alignment 1.000 1.000 3
Comparative 0.667 0.667 9
Context understanding 0.500 0.600 10
Event Sequence 0.538 0.538 13

Daily-Omni by video duration

Duration BF16 NVFP4 W4A16 awq n
30s clips 0.727 0.697 33
60s clips 0.706 0.765 17

Quantization scope

Layer State
thinker.model.* attention QKV/O NVFP4 W4A16
thinker.model.*.mlp.experts.* (gate_proj, up_proj, down_proj) NVFP4 W4A16
thinker.model.*.mlp.gate (MoE router) BF16 (kept full-precision)
thinker.audio_tower.*, thinker.visual.*, thinker.lm_head BF16
talker, code2wav BF16

On-disk format

Tensor dtype Notes
*.weight uint8 Packed FP4 E2M1, 2 values per byte
*.weight_scale float8_e4m3fn Per-block scale (16-element groups along input dim)
*.weight_scale_2 float32 Per-tensor scalar scale

Disk size: ~26 GB (vs ~62 GB BF16, ~57% reduction).

Inference via vllm-omni

from vllm_omni import Omni
omni = Omni(model="YihongJin/Qwen3-Omni-30B-A3B-Instruct-NVFP4-W4A16-awq")

Or via OpenAI-compatible server:

vllm serve YihongJin/Qwen3-Omni-30B-A3B-Instruct-NVFP4-W4A16-awq \
    --omni --port 8000

Compute requirement: sm_80+ (Ampere or newer) for the FP4 Marlin GEMM path. On sm_120 (RTX 5090, Pro 6000) and sm_90 (H100/H200) the dequant-then-BF16 matmul kernel handles W4A16 correctly.

Note for vLLM ≤ v0.21.x users: The MoE FP4 Marlin selection that makes this checkpoint work needs vLLM PR #42566. vllm-omni's upcoming PR adds a patch.py backport that self-extinguishes once vLLM ≥0.22 ships.

Calibration recipe

  • Base model: Qwen/Qwen3-Omni-30B-A3B-Instruct in bfloat16
  • ModelOpt: nvidia-modelopt[torch] >= 0.42 (tested with 0.45.0.dev)
  • Config: mtq.W4A16_NVFP4_CFG + algorithm awq_clip
  • Samples: 1024 (wikitext-103-raw-v1 + openai_humaneval mix; truncated to 512 tokens)
  • Excluded patterns: *audio_tower*, *visual*, *talker*, *code2wav*, *lm_head*, *mlp.gate*

The exported hf_quant_config.json enumerates *mlp.gate* and *lm_head* in exclude_modules (added post-export because ModelOpt's exporter silently drops these from the wildcard exclusions when calibration was done correctly — see the vllm-omni PR for the upstream-tracking patch).

Variant: -max

A sibling checkpoint calibrated with ModelOpt's simpler max algorithm: YihongJin/Qwen3-Omni-30B-A3B-Instruct-NVFP4-W4A16-max.

On the same benchmarks: MMLU 0.74, GSM8K 0.92, Daily-Omni 0.74 — slightly different accuracy/cost tradeoff (faster to calibrate, no AWQ pass).

License

Apache-2.0 (inherits from the base Qwen3-Omni-30B-A3B-Instruct model). See the base model page for additional usage terms.

W4A4 NVFP4 siblings

Same model, broader quantization surface (weights and activations to FP4) for Blackwell-only deployments:

Downloads last month
264
Safetensors
Model size
20B params
Tensor type
BF16
·
F8_E4M3
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for YihongJin/Qwen3-Omni-30B-A3B-Instruct-NVFP4-W4A16-awq

Quantized
(24)
this model