Qwen3-Omni-30B-A3B-Instruct — NVFP4 W4A16 (max)
ModelOpt NVFP4 W4A16 quantization of Qwen/Qwen3-Omni-30B-A3B-Instruct, calibrated with NVIDIA TensorRT-Model-Optimizer's W4A16_NVFP4_CFG + max algorithm (no AWQ pass).
This checkpoint keeps multimodal capability intact while shrinking the thinker LM by ~57% on disk; multimodal encoders, talker, code2wav remain BF16.
Accuracy vs BF16 baseline
Benchmarked on RTX PRO 6000 Blackwell WS (sm_120), 0-shot greedy, seed=42. Daily-Omni accuracy uses the canonical vllm-omni in-tree harness.
Headline
| Benchmark | BF16 | NVFP4 W4A16 max | Δ |
|---|---|---|---|
| MMLU (n=100) | 0.72 | 0.74 | +0.02 |
| GSM8K (n=100) | 0.95 | 0.92 | −0.03 |
| Daily-Omni overall (n=50) | 0.72 | 0.74 | +0.02 |
Within statistical noise on all three. Passes Daily-Omni's default 0.67 accept gate. The max variant edges BF16 slightly on MMLU and Daily-Omni at this sample size, trading ~3pt on GSM8K. The AWQ-clip variant has the opposite tradeoff (see sibling repo below).
Daily-Omni by question type
| Task type | BF16 | NVFP4 W4A16 max | n |
|---|---|---|---|
| Reasoning | 1.000 | 1.000 | 11 |
| Inference | 1.000 | 1.000 | 4 |
| AV Event Alignment | 1.000 | 1.000 | 3 |
| Comparative | 0.667 | 0.778 | 9 |
| Context understanding | 0.500 | 0.600 | 10 |
| Event Sequence | 0.538 | 0.462 | 13 |
Daily-Omni by video duration
| Duration | BF16 | NVFP4 W4A16 max | n |
|---|---|---|---|
| 30s clips | 0.727 | 0.727 | 33 |
| 60s clips | 0.706 | 0.765 | 17 |
Quantization scope
| Layer | State |
|---|---|
thinker.model.* attention QKV/O |
NVFP4 W4A16 |
thinker.model.*.mlp.experts.* (gate_proj, up_proj, down_proj) |
NVFP4 W4A16 |
thinker.model.*.mlp.gate (MoE router) |
BF16 (kept full-precision) |
thinker.audio_tower.*, thinker.visual.*, thinker.lm_head |
BF16 |
talker, code2wav |
BF16 |
On-disk format
| Tensor | dtype | Notes |
|---|---|---|
*.weight |
uint8 | Packed FP4 E2M1, 2 values per byte |
*.weight_scale |
float8_e4m3fn | Per-block scale (16-element groups along input dim) |
*.weight_scale_2 |
float32 | Per-tensor scalar scale |
Disk size: ~26 GB (vs ~62 GB BF16, ~57% reduction).
Inference via vllm-omni
from vllm_omni import Omni
omni = Omni(model="YihongJin/Qwen3-Omni-30B-A3B-Instruct-NVFP4-W4A16-max")
Or via OpenAI-compatible server:
vllm serve YihongJin/Qwen3-Omni-30B-A3B-Instruct-NVFP4-W4A16-max \
--omni --port 8000
Compute requirement: sm_80+ (Ampere or newer) for the FP4 Marlin GEMM path.
Note for vLLM ≤ v0.21.x users: The MoE FP4 Marlin selection that makes this checkpoint work needs vLLM PR #42566. vllm-omni's upcoming PR adds a
patch.pybackport that self-extinguishes once vLLM ≥0.22 ships.
Calibration recipe
- Base model:
Qwen/Qwen3-Omni-30B-A3B-Instructinbfloat16 - ModelOpt:
nvidia-modelopt[torch] >= 0.42 - Config:
mtq.W4A16_NVFP4_CFG+ algorithmmax - Samples: 1024 (wikitext-103-raw-v1 + openai_humaneval mix; truncated to 512 tokens)
- Excluded patterns:
*audio_tower*,*visual*,*talker*,*code2wav*,*lm_head*,*mlp.gate*
The exported hf_quant_config.json enumerates *mlp.gate* and *lm_head* in exclude_modules (added post-export because ModelOpt's exporter silently drops these wildcard patterns even when calibration honors them).
Variant: -awq
A sibling checkpoint calibrated with ModelOpt's awq_clip algorithm:
YihongJin/Qwen3-Omni-30B-A3B-Instruct-NVFP4-W4A16-awq.
On the same benchmarks: MMLU 0.72, GSM8K 0.94, Daily-Omni 0.72 — flatter accuracy profile, slightly slower to calibrate.
License
Apache-2.0 (inherits from the base Qwen3-Omni-30B-A3B-Instruct model). See the base model page for additional usage terms.
W4A4 NVFP4 siblings
Same model, broader quantization surface (weights and activations to FP4) for Blackwell-only deployments:
YihongJin/Qwen3-Omni-30B-A3B-Instruct-NVFP4-W4A4-full-thinker-awqclip— production W4A4, AWQ-clip text calibration, B200 throughput +33% vs BF16 at conc=32. Requiresvllm-omniwith the load-time NaN clamp from vllm-project/vllm-omni#4025 (defensive override, no-op for clean checkpoints).
- Downloads last month
- 209
Model tree for YihongJin/Qwen3-Omni-30B-A3B-Instruct-NVFP4-W4A16-max
Base model
Qwen/Qwen3-Omni-30B-A3B-Instruct