Qwen3-Omni-30B-A3B-Instruct — NVFP4 W4A4 (experts-only, MSE)
ModelOpt NVFP4 W4A4 quantization of Qwen/Qwen3-Omni-30B-A3B-Instruct, calibrated with NVIDIA TensorRT-Model-Optimizer's NVFP4_EXPERTS_ONLY_CFG + MSE weight calibration with FP8 scale sweep.
Both weights and activations are quantized to FP4 for the MoE expert MLP path — the most parameter-heavy part of the model — while attention, embeddings, norms, the MoE router, and all multimodal encoders stay in BF16. The result keeps multimodal capability intact while shrinking the thinker LM by ~57% on disk; talker and code2wav remain BF16.
Recommendation: if you want the broadest W4A4 quantization scope and best B200 throughput, prefer
YihongJin/Qwen3-Omni-30B-A3B-Instruct-NVFP4-W4A4-full-thinker-awqclip. This experts-only variant is the narrowest W4A4 scope we publish and is useful as a comparison point / minimal W4A4 baseline.
Accuracy vs BF16 baseline
Benchmarked on RTX PRO 6000 Blackwell WS, 0-shot greedy, seed=42. Daily-Omni accuracy uses the canonical vllm-omni in-tree harness.
Headline
| Benchmark | BF16 | NVFP4 W4A4 experts-mse | Δ |
|---|---|---|---|
| MMLU (n=100) | 0.72 | 0.72 | 0.00 |
| GSM8K (n=100) | 0.95 | 0.93 | -0.02 |
| Daily-Omni overall (n=50) | 0.72 | 0.72 | 0.00 |
Within statistical noise on MMLU and Daily-Omni; 2pt under BF16 on GSM8K. Passes Daily-Omni's default 0.67 accept gate.
Daily-Omni by question type
| Task type | BF16 | NVFP4 W4A4 experts-mse | n |
|---|---|---|---|
| Reasoning | 1.000 | 1.000 | 11 |
| Inference | 1.000 | 1.000 | 4 |
| AV Event Alignment | 1.000 | 1.000 | 3 |
| Comparative | 0.667 | 0.667 | 9 |
| Context understanding | 0.500 | 0.500 | 10 |
| Event Sequence | 0.538 | 0.538 | 13 |
Multimodal smoke test
All five input modality paths produce coherent text on the canonical vLLM demo assets (verified via examples/offline_inference/qwen3_omni/end2end.py):
| Modality | Result |
|---|---|
| audio | ✓ Identified "Mary Had a Little Lamb" + Edison phonograph context |
| image | ✓ Described cherry blossoms + Tokyo Skytree |
| video | ✓ Described baby_reading clip and its humor |
| mixed (a+i+v) | ✓ Summarized all three streams separately |
use_audio_in_video |
✓ Described video + attempted to transcribe baby's audio |
Quantization scope
| Layer | State |
|---|---|
thinker.model.*.mlp.experts.* (gate_proj, up_proj, down_proj) |
NVFP4 W4A4 |
thinker.model.* attention QKV/O |
BF16 (kept full-precision) |
thinker.model.* embeddings, norms |
BF16 |
thinker.model.*.mlp.gate (MoE router) |
BF16 |
thinker.audio_tower.*, thinker.visual.*, thinker.lm_head |
BF16 |
talker, code2wav |
BF16 |
Calibration recipe
- Base model:
Qwen/Qwen3-Omni-30B-A3B-Instructinbfloat16 - ModelOpt:
nvidia-modelopt==0.44.0(vanilla; see Mitigations below) - Config:
mtq.NVFP4_EXPERTS_ONLY_CFG, algorithm overridden to{"method": "mse", "fp8_scale_sweep": true} - Samples: 512 from
HuggingFaceH4/ultrachat_200k(instruction-style, exercises diverse MoE expert routing); truncated to 2048 tokens per sample - Excluded patterns:
*audio_tower*,*visual*,*talker*,*code2wav*,*lm_head*,*mlp.gate*(all 48 routers explicitly per-layer) - Calibration time: ~140 min on a single RTX PRO 6000 Blackwell WS
Calibration caveat
Roughly 10 of the 128 experts (specifically: 40, 43, 56, 72, 73, 81, 87, 91, 108, 121) were not activated during the 512-sample calibration pass, so their weight quantizers fell back to a weight-derived amax rather than an observed activation-driven one. This did not visibly impact accuracy on our benchmarks, but raising calibration sample count or using router-balanced data could close the remaining gap.
Inference
from vllm_omni import Omni
omni = Omni(model="YihongJin/Qwen3-Omni-30B-A3B-Instruct-NVFP4-W4A4-experts-mse")
OpenAI-compatible server:
vllm serve YihongJin/Qwen3-Omni-30B-A3B-Instruct-NVFP4-W4A4-experts-mse \
--omni --port 8000
Do not pass
--enforce-eagerfor benchmarks. CUDA graphs amortize kernel launch overhead and unlock the FP4 throughput wins; with--enforce-eagerset, W4A4 TPOT degrades ~10x relative to the CUDA-graph configuration.
Compute requirement: sm_100+ (Blackwell — B100, B200, RTX 5090, RTX Pro 6000) for native FP4 tensor cores. On non-Blackwell GPUs the checkpoint loads via the emulation path (BF16 dequant + matmul) — correct but loses the W4A4 latency benefit.
ModelOpt 0.44 NaN regression — two mitigation paths
ModelOpt 0.44's float32 -> torch.float8_e4m3fn cast of per-block weight_scale occasionally emits literal NaN bytes (E4M3 encoding 0x7F / 0xFF) when the pre-cast scale rounds above the FP8 max of 448 after the global-scale division. A single NaN byte in any weight_scale propagates through the FlashInfer FP4 GEMM into the residual stream and collapses the served model output to !!!!. Two complementary fixes:
Calibration-time (ModelOpt-side): clamp the pre-cast values to
torch.finfo(torch.float8_e4m3fn).maxbefore every.to(torch.float8_e4m3fn)at the two cast sites inmodelopt/torch/quantization/qtensor/nvfp4_tensor.pyandmodelopt/torch/export/quant_utils.py. This checkpoint was calibrated with vanilla ModelOpt 0.44 (before that fix was added), but the experts-only quantization scope did not trigger the NaN-byte cast path in any block we observed — we have not seen!!!!failures when serving this checkpoint on stock vllm-omni.Load-time (vllm-omni-side): vllm-project/vllm-omni#4025 installs a defensive override of
ModelOptNvFp4LinearMethod.process_weights_after_loadingthat scansweight_scalefor NaN bytes and clamps them to FP8 E4M3 max at worker init. For this checkpoint the override is a no-op safety net; it primarily protects wider-scope W4A4 NVFP4 checkpoints that were exported with vanilla ModelOpt 0.44 and currently serve as!!!!. Self-extinguishes once vllm-omni's vllm pin includes the corresponding upstream vLLM fix.
Variant: W4A16 siblings
If your platform doesn't have FP4 tensor cores (pre-Blackwell), prefer the W4A16 variants — same scope, same size, only weights are FP4 while activations stay BF16:
YihongJin/Qwen3-Omni-30B-A3B-Instruct-NVFP4-W4A16-awqYihongJin/Qwen3-Omni-30B-A3B-Instruct-NVFP4-W4A16-max
License
Apache-2.0 (inherits from the base Qwen3-Omni-30B-A3B-Instruct model). See the base model page for additional usage terms.
- Downloads last month
- 76
Model tree for YihongJin/Qwen3-Omni-30B-A3B-Instruct-NVFP4-W4A4-experts-mse
Base model
Qwen/Qwen3-Omni-30B-A3B-Instruct