Qwen3-Omni-30B-A3B-Instruct — NVFP4 W4A16 (max)

ModelOpt NVFP4 W4A16 quantization of Qwen/Qwen3-Omni-30B-A3B-Instruct, calibrated with NVIDIA TensorRT-Model-Optimizer's W4A16_NVFP4_CFG + max algorithm (no AWQ pass).

This checkpoint keeps multimodal capability intact while shrinking the thinker LM by ~57% on disk; multimodal encoders, talker, code2wav remain BF16.

Accuracy vs BF16 baseline

Benchmarked on RTX PRO 6000 Blackwell WS (sm_120), 0-shot greedy, seed=42. Daily-Omni accuracy uses the canonical vllm-omni in-tree harness.

Headline

Benchmark BF16 NVFP4 W4A16 max Δ
MMLU (n=100) 0.72 0.74 +0.02
GSM8K (n=100) 0.95 0.92 −0.03
Daily-Omni overall (n=50) 0.72 0.74 +0.02

Within statistical noise on all three. Passes Daily-Omni's default 0.67 accept gate. The max variant edges BF16 slightly on MMLU and Daily-Omni at this sample size, trading ~3pt on GSM8K. The AWQ-clip variant has the opposite tradeoff (see sibling repo below).

Daily-Omni by question type

Task type BF16 NVFP4 W4A16 max n
Reasoning 1.000 1.000 11
Inference 1.000 1.000 4
AV Event Alignment 1.000 1.000 3
Comparative 0.667 0.778 9
Context understanding 0.500 0.600 10
Event Sequence 0.538 0.462 13

Daily-Omni by video duration

Duration BF16 NVFP4 W4A16 max n
30s clips 0.727 0.727 33
60s clips 0.706 0.765 17

Quantization scope

Layer State
thinker.model.* attention QKV/O NVFP4 W4A16
thinker.model.*.mlp.experts.* (gate_proj, up_proj, down_proj) NVFP4 W4A16
thinker.model.*.mlp.gate (MoE router) BF16 (kept full-precision)
thinker.audio_tower.*, thinker.visual.*, thinker.lm_head BF16
talker, code2wav BF16

On-disk format

Tensor dtype Notes
*.weight uint8 Packed FP4 E2M1, 2 values per byte
*.weight_scale float8_e4m3fn Per-block scale (16-element groups along input dim)
*.weight_scale_2 float32 Per-tensor scalar scale

Disk size: ~26 GB (vs ~62 GB BF16, ~57% reduction).

Inference via vllm-omni

from vllm_omni import Omni
omni = Omni(model="YihongJin/Qwen3-Omni-30B-A3B-Instruct-NVFP4-W4A16-max")

Or via OpenAI-compatible server:

vllm serve YihongJin/Qwen3-Omni-30B-A3B-Instruct-NVFP4-W4A16-max \
    --omni --port 8000

Compute requirement: sm_80+ (Ampere or newer) for the FP4 Marlin GEMM path.

Note for vLLM ≤ v0.21.x users: The MoE FP4 Marlin selection that makes this checkpoint work needs vLLM PR #42566. vllm-omni's upcoming PR adds a patch.py backport that self-extinguishes once vLLM ≥0.22 ships.

Calibration recipe

  • Base model: Qwen/Qwen3-Omni-30B-A3B-Instruct in bfloat16
  • ModelOpt: nvidia-modelopt[torch] >= 0.42
  • Config: mtq.W4A16_NVFP4_CFG + algorithm max
  • Samples: 1024 (wikitext-103-raw-v1 + openai_humaneval mix; truncated to 512 tokens)
  • Excluded patterns: *audio_tower*, *visual*, *talker*, *code2wav*, *lm_head*, *mlp.gate*

The exported hf_quant_config.json enumerates *mlp.gate* and *lm_head* in exclude_modules (added post-export because ModelOpt's exporter silently drops these wildcard patterns even when calibration honors them).

Variant: -awq

A sibling checkpoint calibrated with ModelOpt's awq_clip algorithm: YihongJin/Qwen3-Omni-30B-A3B-Instruct-NVFP4-W4A16-awq.

On the same benchmarks: MMLU 0.72, GSM8K 0.94, Daily-Omni 0.72 — flatter accuracy profile, slightly slower to calibrate.

License

Apache-2.0 (inherits from the base Qwen3-Omni-30B-A3B-Instruct model). See the base model page for additional usage terms.

W4A4 NVFP4 siblings

Same model, broader quantization surface (weights and activations to FP4) for Blackwell-only deployments:

Downloads last month
209
Safetensors
Model size
20B params
Tensor type
BF16
·
F8_E4M3
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for YihongJin/Qwen3-Omni-30B-A3B-Instruct-NVFP4-W4A16-max

Quantized
(24)
this model