Qwen3-Omni-30B-A3B-Instruct — NVFP4 W4A4 (experts-only, MSE)

ModelOpt NVFP4 W4A4 quantization of Qwen/Qwen3-Omni-30B-A3B-Instruct, calibrated with NVIDIA TensorRT-Model-Optimizer's NVFP4_EXPERTS_ONLY_CFG + MSE weight calibration with FP8 scale sweep.

Both weights and activations are quantized to FP4 for the MoE expert MLP path — the most parameter-heavy part of the model — while attention, embeddings, norms, the MoE router, and all multimodal encoders stay in BF16. The result keeps multimodal capability intact while shrinking the thinker LM by ~57% on disk; talker and code2wav remain BF16.

Recommendation: if you want the broadest W4A4 quantization scope and best B200 throughput, prefer YihongJin/Qwen3-Omni-30B-A3B-Instruct-NVFP4-W4A4-full-thinker-awqclip. This experts-only variant is the narrowest W4A4 scope we publish and is useful as a comparison point / minimal W4A4 baseline.

Accuracy vs BF16 baseline

Benchmarked on RTX PRO 6000 Blackwell WS, 0-shot greedy, seed=42. Daily-Omni accuracy uses the canonical vllm-omni in-tree harness.

Headline

Benchmark	BF16	NVFP4 W4A4 experts-mse	Δ
MMLU (n=100)	0.72	0.72	0.00
GSM8K (n=100)	0.95	0.93	-0.02
Daily-Omni overall (n=50)	0.72	0.72	0.00

Within statistical noise on MMLU and Daily-Omni; 2pt under BF16 on GSM8K. Passes Daily-Omni's default 0.67 accept gate.

Daily-Omni by question type

Task type	BF16	NVFP4 W4A4 experts-mse	n
Reasoning	1.000	1.000	11
Inference	1.000	1.000	4
AV Event Alignment	1.000	1.000	3
Comparative	0.667	0.667	9
Context understanding	0.500	0.500	10
Event Sequence	0.538	0.538	13

Multimodal smoke test

All five input modality paths produce coherent text on the canonical vLLM demo assets (verified via examples/offline_inference/qwen3_omni/end2end.py):

Modality	Result
audio	✓ Identified "Mary Had a Little Lamb" + Edison phonograph context
image	✓ Described cherry blossoms + Tokyo Skytree
video	✓ Described `baby_reading` clip and its humor
mixed (a+i+v)	✓ Summarized all three streams separately
`use_audio_in_video`	✓ Described video + attempted to transcribe baby's audio

Quantization scope

Layer	State
`thinker.model..mlp.experts.` (gate_proj, up_proj, down_proj)	NVFP4 W4A4
`thinker.model.*` attention QKV/O	BF16 (kept full-precision)
`thinker.model.*` embeddings, norms	BF16
`thinker.model.*.mlp.gate` (MoE router)	BF16
`thinker.audio_tower.`, `thinker.visual.`, `thinker.lm_head`	BF16
`talker`, `code2wav`	BF16

Calibration recipe

Base model: Qwen/Qwen3-Omni-30B-A3B-Instruct in bfloat16
ModelOpt: nvidia-modelopt==0.44.0 (vanilla; see Mitigations below)
Config: mtq.NVFP4_EXPERTS_ONLY_CFG, algorithm overridden to {"method": "mse", "fp8_scale_sweep": true}
Samples: 512 from HuggingFaceH4/ultrachat_200k (instruction-style, exercises diverse MoE expert routing); truncated to 2048 tokens per sample
Excluded patterns: *audio_tower*, *visual*, *talker*, *code2wav*, *lm_head*, *mlp.gate* (all 48 routers explicitly per-layer)
Calibration time: ~140 min on a single RTX PRO 6000 Blackwell WS

Calibration caveat

Roughly 10 of the 128 experts (specifically: 40, 43, 56, 72, 73, 81, 87, 91, 108, 121) were not activated during the 512-sample calibration pass, so their weight quantizers fell back to a weight-derived amax rather than an observed activation-driven one. This did not visibly impact accuracy on our benchmarks, but raising calibration sample count or using router-balanced data could close the remaining gap.

Inference

from vllm_omni import Omni
omni = Omni(model="YihongJin/Qwen3-Omni-30B-A3B-Instruct-NVFP4-W4A4-experts-mse")

OpenAI-compatible server:

vllm serve YihongJin/Qwen3-Omni-30B-A3B-Instruct-NVFP4-W4A4-experts-mse \
    --omni --port 8000

Do not pass --enforce-eager for benchmarks. CUDA graphs amortize kernel launch overhead and unlock the FP4 throughput wins; with --enforce-eager set, W4A4 TPOT degrades ~10x relative to the CUDA-graph configuration.

Compute requirement: sm_100+ (Blackwell — B100, B200, RTX 5090, RTX Pro 6000) for native FP4 tensor cores. On non-Blackwell GPUs the checkpoint loads via the emulation path (BF16 dequant + matmul) — correct but loses the W4A4 latency benefit.

ModelOpt 0.44 NaN regression — two mitigation paths

ModelOpt 0.44's float32 -> torch.float8_e4m3fn cast of per-block weight_scale occasionally emits literal NaN bytes (E4M3 encoding 0x7F / 0xFF) when the pre-cast scale rounds above the FP8 max of 448 after the global-scale division. A single NaN byte in any weight_scale propagates through the FlashInfer FP4 GEMM into the residual stream and collapses the served model output to !!!!. Two complementary fixes:

Calibration-time (ModelOpt-side): clamp the pre-cast values to torch.finfo(torch.float8_e4m3fn).max before every .to(torch.float8_e4m3fn) at the two cast sites in modelopt/torch/quantization/qtensor/nvfp4_tensor.py and modelopt/torch/export/quant_utils.py. This checkpoint was calibrated with vanilla ModelOpt 0.44 (before that fix was added), but the experts-only quantization scope did not trigger the NaN-byte cast path in any block we observed — we have not seen !!!! failures when serving this checkpoint on stock vllm-omni.
Load-time (vllm-omni-side): vllm-project/vllm-omni#4025 installs a defensive override of ModelOptNvFp4LinearMethod.process_weights_after_loading that scans weight_scale for NaN bytes and clamps them to FP8 E4M3 max at worker init. For this checkpoint the override is a no-op safety net; it primarily protects wider-scope W4A4 NVFP4 checkpoints that were exported with vanilla ModelOpt 0.44 and currently serve as !!!!. Self-extinguishes once vllm-omni's vllm pin includes the corresponding upstream vLLM fix.

Variant: W4A16 siblings

If your platform doesn't have FP4 tensor cores (pre-Blackwell), prefer the W4A16 variants — same scope, same size, only weights are FP4 while activations stay BF16:

License

Apache-2.0 (inherits from the base Qwen3-Omni-30B-A3B-Instruct model). See the base model page for additional usage terms.

Downloads last month: 76

Safetensors

Model size

21B params

Tensor type

BF16

F8_E4M3

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for YihongJin/Qwen3-Omni-30B-A3B-Instruct-NVFP4-W4A4-experts-mse

Base model

Qwen/Qwen3-Omni-30B-A3B-Instruct

Quantized

(24)

this model