Qwen3-Omni-30B-A3B-Instruct — NVFP4 W4A16 (max)

ModelOpt NVFP4 W4A16 quantization of Qwen/Qwen3-Omni-30B-A3B-Instruct, calibrated with NVIDIA TensorRT-Model-Optimizer's W4A16_NVFP4_CFG + max algorithm (no AWQ pass).

This checkpoint keeps multimodal capability intact while shrinking the thinker LM by ~57% on disk; multimodal encoders, talker, code2wav remain BF16.

Accuracy vs BF16 baseline

Benchmarked on RTX PRO 6000 Blackwell WS (sm_120), 0-shot greedy, seed=42. Daily-Omni accuracy uses the canonical vllm-omni in-tree harness.

Headline

Benchmark	BF16	NVFP4 W4A16 max	Δ
MMLU (n=100)	0.72	0.74	+0.02
GSM8K (n=100)	0.95	0.92	−0.03
Daily-Omni overall (n=50)	0.72	0.74	+0.02

Within statistical noise on all three. Passes Daily-Omni's default 0.67 accept gate. The max variant edges BF16 slightly on MMLU and Daily-Omni at this sample size, trading ~3pt on GSM8K. The AWQ-clip variant has the opposite tradeoff (see sibling repo below).

Daily-Omni by question type

Task type	BF16	NVFP4 W4A16 max	n
Reasoning	1.000	1.000	11
Inference	1.000	1.000	4
AV Event Alignment	1.000	1.000	3
Comparative	0.667	0.778	9
Context understanding	0.500	0.600	10
Event Sequence	0.538	0.462	13

Daily-Omni by video duration

Duration	BF16	NVFP4 W4A16 max	n
30s clips	0.727	0.727	33
60s clips	0.706	0.765	17

Quantization scope

Layer	State
`thinker.model.*` attention QKV/O	NVFP4 W4A16
`thinker.model..mlp.experts.` (gate_proj, up_proj, down_proj)	NVFP4 W4A16
`thinker.model.*.mlp.gate` (MoE router)	BF16 (kept full-precision)
`thinker.audio_tower.`, `thinker.visual.`, `thinker.lm_head`	BF16
`talker`, `code2wav`	BF16

On-disk format

Tensor	dtype	Notes
`*.weight`	uint8	Packed FP4 E2M1, 2 values per byte
`*.weight_scale`	float8_e4m3fn	Per-block scale (16-element groups along input dim)
`*.weight_scale_2`	float32	Per-tensor scalar scale

Disk size: ~26 GB (vs ~62 GB BF16, ~57% reduction).

Inference via vllm-omni

from vllm_omni import Omni
omni = Omni(model="YihongJin/Qwen3-Omni-30B-A3B-Instruct-NVFP4-W4A16-max")

Or via OpenAI-compatible server:

vllm serve YihongJin/Qwen3-Omni-30B-A3B-Instruct-NVFP4-W4A16-max \
    --omni --port 8000

Compute requirement: sm_80+ (Ampere or newer) for the FP4 Marlin GEMM path.

Note for vLLM ≤ v0.21.x users: The MoE FP4 Marlin selection that makes this checkpoint work needs vLLM PR #42566. vllm-omni's upcoming PR adds a patch.py backport that self-extinguishes once vLLM ≥0.22 ships.

Calibration recipe

Base model: Qwen/Qwen3-Omni-30B-A3B-Instruct in bfloat16
ModelOpt: nvidia-modelopt[torch] >= 0.42
Config: mtq.W4A16_NVFP4_CFG + algorithm max
Samples: 1024 (wikitext-103-raw-v1 + openai_humaneval mix; truncated to 512 tokens)
Excluded patterns: *audio_tower*, *visual*, *talker*, *code2wav*, *lm_head*, *mlp.gate*

The exported hf_quant_config.json enumerates *mlp.gate* and *lm_head* in exclude_modules (added post-export because ModelOpt's exporter silently drops these wildcard patterns even when calibration honors them).

Variant: `-awq`

A sibling checkpoint calibrated with ModelOpt's awq_clip algorithm: YihongJin/Qwen3-Omni-30B-A3B-Instruct-NVFP4-W4A16-awq.

On the same benchmarks: MMLU 0.72, GSM8K 0.94, Daily-Omni 0.72 — flatter accuracy profile, slightly slower to calibrate.

License

Apache-2.0 (inherits from the base Qwen3-Omni-30B-A3B-Instruct model). See the base model page for additional usage terms.

W4A4 NVFP4 siblings

Same model, broader quantization surface (weights and activations to FP4) for Blackwell-only deployments:

YihongJin/Qwen3-Omni-30B-A3B-Instruct-NVFP4-W4A4-full-thinker-awqclip — production W4A4, AWQ-clip text calibration, B200 throughput +33% vs BF16 at conc=32. Requires vllm-omni with the load-time NaN clamp from vllm-project/vllm-omni#4025 (defensive override, no-op for clean checkpoints).

Downloads last month: 209

Safetensors

Model size

20B params

Tensor type

BF16

F8_E4M3

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for YihongJin/Qwen3-Omni-30B-A3B-Instruct-NVFP4-W4A16-max

Base model

Qwen/Qwen3-Omni-30B-A3B-Instruct

Quantized

(24)

this model