Qwen3-Omni-30B-A3B-Instruct — NVFP4 W4A4 (experts-only, MSE)

ModelOpt NVFP4 W4A4 quantization of Qwen/Qwen3-Omni-30B-A3B-Instruct, calibrated with NVIDIA TensorRT-Model-Optimizer's NVFP4_EXPERTS_ONLY_CFG + MSE weight calibration with FP8 scale sweep.

Both weights and activations are quantized to FP4 for the MoE expert MLP path — the most parameter-heavy part of the model — while attention, embeddings, norms, the MoE router, and all multimodal encoders stay in BF16. The result keeps multimodal capability intact while shrinking the thinker LM by ~57% on disk; talker and code2wav remain BF16.

Recommendation: if you want the broadest W4A4 quantization scope and best B200 throughput, prefer YihongJin/Qwen3-Omni-30B-A3B-Instruct-NVFP4-W4A4-full-thinker-awqclip. This experts-only variant is the narrowest W4A4 scope we publish and is useful as a comparison point / minimal W4A4 baseline.

Accuracy vs BF16 baseline

Benchmarked on RTX PRO 6000 Blackwell WS, 0-shot greedy, seed=42. Daily-Omni accuracy uses the canonical vllm-omni in-tree harness.

Headline

Benchmark BF16 NVFP4 W4A4 experts-mse Δ
MMLU (n=100) 0.72 0.72 0.00
GSM8K (n=100) 0.95 0.93 -0.02
Daily-Omni overall (n=50) 0.72 0.72 0.00

Within statistical noise on MMLU and Daily-Omni; 2pt under BF16 on GSM8K. Passes Daily-Omni's default 0.67 accept gate.

Daily-Omni by question type

Task type BF16 NVFP4 W4A4 experts-mse n
Reasoning 1.000 1.000 11
Inference 1.000 1.000 4
AV Event Alignment 1.000 1.000 3
Comparative 0.667 0.667 9
Context understanding 0.500 0.500 10
Event Sequence 0.538 0.538 13

Multimodal smoke test

All five input modality paths produce coherent text on the canonical vLLM demo assets (verified via examples/offline_inference/qwen3_omni/end2end.py):

Modality Result
audio ✓ Identified "Mary Had a Little Lamb" + Edison phonograph context
image ✓ Described cherry blossoms + Tokyo Skytree
video ✓ Described baby_reading clip and its humor
mixed (a+i+v) ✓ Summarized all three streams separately
use_audio_in_video ✓ Described video + attempted to transcribe baby's audio

Quantization scope

Layer State
thinker.model.*.mlp.experts.* (gate_proj, up_proj, down_proj) NVFP4 W4A4
thinker.model.* attention QKV/O BF16 (kept full-precision)
thinker.model.* embeddings, norms BF16
thinker.model.*.mlp.gate (MoE router) BF16
thinker.audio_tower.*, thinker.visual.*, thinker.lm_head BF16
talker, code2wav BF16

Calibration recipe

  • Base model: Qwen/Qwen3-Omni-30B-A3B-Instruct in bfloat16
  • ModelOpt: nvidia-modelopt==0.44.0 (vanilla; see Mitigations below)
  • Config: mtq.NVFP4_EXPERTS_ONLY_CFG, algorithm overridden to {"method": "mse", "fp8_scale_sweep": true}
  • Samples: 512 from HuggingFaceH4/ultrachat_200k (instruction-style, exercises diverse MoE expert routing); truncated to 2048 tokens per sample
  • Excluded patterns: *audio_tower*, *visual*, *talker*, *code2wav*, *lm_head*, *mlp.gate* (all 48 routers explicitly per-layer)
  • Calibration time: ~140 min on a single RTX PRO 6000 Blackwell WS

Calibration caveat

Roughly 10 of the 128 experts (specifically: 40, 43, 56, 72, 73, 81, 87, 91, 108, 121) were not activated during the 512-sample calibration pass, so their weight quantizers fell back to a weight-derived amax rather than an observed activation-driven one. This did not visibly impact accuracy on our benchmarks, but raising calibration sample count or using router-balanced data could close the remaining gap.

Inference

from vllm_omni import Omni
omni = Omni(model="YihongJin/Qwen3-Omni-30B-A3B-Instruct-NVFP4-W4A4-experts-mse")

OpenAI-compatible server:

vllm serve YihongJin/Qwen3-Omni-30B-A3B-Instruct-NVFP4-W4A4-experts-mse \
    --omni --port 8000

Do not pass --enforce-eager for benchmarks. CUDA graphs amortize kernel launch overhead and unlock the FP4 throughput wins; with --enforce-eager set, W4A4 TPOT degrades ~10x relative to the CUDA-graph configuration.

Compute requirement: sm_100+ (Blackwell — B100, B200, RTX 5090, RTX Pro 6000) for native FP4 tensor cores. On non-Blackwell GPUs the checkpoint loads via the emulation path (BF16 dequant + matmul) — correct but loses the W4A4 latency benefit.

ModelOpt 0.44 NaN regression — two mitigation paths

ModelOpt 0.44's float32 -> torch.float8_e4m3fn cast of per-block weight_scale occasionally emits literal NaN bytes (E4M3 encoding 0x7F / 0xFF) when the pre-cast scale rounds above the FP8 max of 448 after the global-scale division. A single NaN byte in any weight_scale propagates through the FlashInfer FP4 GEMM into the residual stream and collapses the served model output to !!!!. Two complementary fixes:

  1. Calibration-time (ModelOpt-side): clamp the pre-cast values to torch.finfo(torch.float8_e4m3fn).max before every .to(torch.float8_e4m3fn) at the two cast sites in modelopt/torch/quantization/qtensor/nvfp4_tensor.py and modelopt/torch/export/quant_utils.py. This checkpoint was calibrated with vanilla ModelOpt 0.44 (before that fix was added), but the experts-only quantization scope did not trigger the NaN-byte cast path in any block we observed — we have not seen !!!! failures when serving this checkpoint on stock vllm-omni.

  2. Load-time (vllm-omni-side): vllm-project/vllm-omni#4025 installs a defensive override of ModelOptNvFp4LinearMethod.process_weights_after_loading that scans weight_scale for NaN bytes and clamps them to FP8 E4M3 max at worker init. For this checkpoint the override is a no-op safety net; it primarily protects wider-scope W4A4 NVFP4 checkpoints that were exported with vanilla ModelOpt 0.44 and currently serve as !!!!. Self-extinguishes once vllm-omni's vllm pin includes the corresponding upstream vLLM fix.

Variant: W4A16 siblings

If your platform doesn't have FP4 tensor cores (pre-Blackwell), prefer the W4A16 variants — same scope, same size, only weights are FP4 while activations stay BF16:

License

Apache-2.0 (inherits from the base Qwen3-Omni-30B-A3B-Instruct model). See the base model page for additional usage terms.

Downloads last month
76
Safetensors
Model size
21B params
Tensor type
BF16
·
F8_E4M3
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for YihongJin/Qwen3-Omni-30B-A3B-Instruct-NVFP4-W4A4-experts-mse

Quantized
(24)
this model