Qwen3-Omni-30B-A3B-Instruct — NVFP4 W4A16 (AWQ-clip)

ModelOpt NVFP4 W4A16 quantization of Qwen/Qwen3-Omni-30B-A3B-Instruct, calibrated with NVIDIA TensorRT-Model-Optimizer's W4A16_NVFP4_CFG + awq_clip algorithm.

This checkpoint keeps multimodal capability intact while shrinking the thinker LM by ~57% on disk; multimodal encoders, talker, code2wav remain BF16.

Accuracy vs BF16 baseline

Benchmarked on RTX PRO 6000 Blackwell WS (sm_120), 0-shot greedy, seed=42. Daily-Omni accuracy uses the canonical vllm-omni in-tree harness.

Headline

Benchmark	BF16	NVFP4 W4A16 awq	Δ
MMLU (n=100)	0.72	0.72	0.00
GSM8K (n=100)	0.95	0.94	−0.01
Daily-Omni overall (n=50)	0.72	0.72	0.00

Within statistical noise on all three. Passes Daily-Omni's default 0.67 accept gate.

Daily-Omni by question type

Task type	BF16	NVFP4 W4A16 awq	n
Reasoning	1.000	1.000	11
Inference	1.000	0.750	4
AV Event Alignment	1.000	1.000	3
Comparative	0.667	0.667	9
Context understanding	0.500	0.600	10
Event Sequence	0.538	0.538	13

Daily-Omni by video duration

Duration	BF16	NVFP4 W4A16 awq	n
30s clips	0.727	0.697	33
60s clips	0.706	0.765	17

Quantization scope

Layer	State
`thinker.model.*` attention QKV/O	NVFP4 W4A16
`thinker.model..mlp.experts.` (gate_proj, up_proj, down_proj)	NVFP4 W4A16
`thinker.model.*.mlp.gate` (MoE router)	BF16 (kept full-precision)
`thinker.audio_tower.`, `thinker.visual.`, `thinker.lm_head`	BF16
`talker`, `code2wav`	BF16

On-disk format

Tensor	dtype	Notes
`*.weight`	uint8	Packed FP4 E2M1, 2 values per byte
`*.weight_scale`	float8_e4m3fn	Per-block scale (16-element groups along input dim)
`*.weight_scale_2`	float32	Per-tensor scalar scale

Disk size: ~26 GB (vs ~62 GB BF16, ~57% reduction).

Inference via vllm-omni

from vllm_omni import Omni
omni = Omni(model="YihongJin/Qwen3-Omni-30B-A3B-Instruct-NVFP4-W4A16-awq")

Or via OpenAI-compatible server:

vllm serve YihongJin/Qwen3-Omni-30B-A3B-Instruct-NVFP4-W4A16-awq \
    --omni --port 8000

Compute requirement: sm_80+ (Ampere or newer) for the FP4 Marlin GEMM path. On sm_120 (RTX 5090, Pro 6000) and sm_90 (H100/H200) the dequant-then-BF16 matmul kernel handles W4A16 correctly.

Note for vLLM ≤ v0.21.x users: The MoE FP4 Marlin selection that makes this checkpoint work needs vLLM PR #42566. vllm-omni's upcoming PR adds a patch.py backport that self-extinguishes once vLLM ≥0.22 ships.

Calibration recipe

Base model: Qwen/Qwen3-Omni-30B-A3B-Instruct in bfloat16
ModelOpt: nvidia-modelopt[torch] >= 0.42 (tested with 0.45.0.dev)
Config: mtq.W4A16_NVFP4_CFG + algorithm awq_clip
Samples: 1024 (wikitext-103-raw-v1 + openai_humaneval mix; truncated to 512 tokens)
Excluded patterns: *audio_tower*, *visual*, *talker*, *code2wav*, *lm_head*, *mlp.gate*

The exported hf_quant_config.json enumerates *mlp.gate* and *lm_head* in exclude_modules (added post-export because ModelOpt's exporter silently drops these from the wildcard exclusions when calibration was done correctly — see the vllm-omni PR for the upstream-tracking patch).

Variant: `-max`

A sibling checkpoint calibrated with ModelOpt's simpler max algorithm: YihongJin/Qwen3-Omni-30B-A3B-Instruct-NVFP4-W4A16-max.

On the same benchmarks: MMLU 0.74, GSM8K 0.92, Daily-Omni 0.74 — slightly different accuracy/cost tradeoff (faster to calibrate, no AWQ pass).

License

Apache-2.0 (inherits from the base Qwen3-Omni-30B-A3B-Instruct model). See the base model page for additional usage terms.

W4A4 NVFP4 siblings

Same model, broader quantization surface (weights and activations to FP4) for Blackwell-only deployments:

YihongJin/Qwen3-Omni-30B-A3B-Instruct-NVFP4-W4A4-full-thinker-awqclip — production W4A4, AWQ-clip text calibration, B200 throughput +33% vs BF16 at conc=32. Requires vllm-omni with the load-time NaN clamp from vllm-project/vllm-omni#4025 (defensive override, no-op for clean checkpoints).

Downloads last month: 264

Safetensors

Model size

20B params

Tensor type

BF16

F8_E4M3

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for YihongJin/Qwen3-Omni-30B-A3B-Instruct-NVFP4-W4A16-awq

Base model

Qwen/Qwen3-Omni-30B-A3B-Instruct

Quantized

(24)

this model