Qwen3-Omni-30B-A3B-Instruct — NVFP4 W4A16 (AWQ-clip)
ModelOpt NVFP4 W4A16 quantization of Qwen/Qwen3-Omni-30B-A3B-Instruct, calibrated with NVIDIA TensorRT-Model-Optimizer's W4A16_NVFP4_CFG + awq_clip algorithm.
This checkpoint keeps multimodal capability intact while shrinking the thinker LM by ~57% on disk; multimodal encoders, talker, code2wav remain BF16.
Accuracy vs BF16 baseline
Benchmarked on RTX PRO 6000 Blackwell WS (sm_120), 0-shot greedy, seed=42. Daily-Omni accuracy uses the canonical vllm-omni in-tree harness.
Headline
| Benchmark | BF16 | NVFP4 W4A16 awq | Δ |
|---|---|---|---|
| MMLU (n=100) | 0.72 | 0.72 | 0.00 |
| GSM8K (n=100) | 0.95 | 0.94 | −0.01 |
| Daily-Omni overall (n=50) | 0.72 | 0.72 | 0.00 |
Within statistical noise on all three. Passes Daily-Omni's default 0.67 accept gate.
Daily-Omni by question type
| Task type | BF16 | NVFP4 W4A16 awq | n |
|---|---|---|---|
| Reasoning | 1.000 | 1.000 | 11 |
| Inference | 1.000 | 0.750 | 4 |
| AV Event Alignment | 1.000 | 1.000 | 3 |
| Comparative | 0.667 | 0.667 | 9 |
| Context understanding | 0.500 | 0.600 | 10 |
| Event Sequence | 0.538 | 0.538 | 13 |
Daily-Omni by video duration
| Duration | BF16 | NVFP4 W4A16 awq | n |
|---|---|---|---|
| 30s clips | 0.727 | 0.697 | 33 |
| 60s clips | 0.706 | 0.765 | 17 |
Quantization scope
| Layer | State |
|---|---|
thinker.model.* attention QKV/O |
NVFP4 W4A16 |
thinker.model.*.mlp.experts.* (gate_proj, up_proj, down_proj) |
NVFP4 W4A16 |
thinker.model.*.mlp.gate (MoE router) |
BF16 (kept full-precision) |
thinker.audio_tower.*, thinker.visual.*, thinker.lm_head |
BF16 |
talker, code2wav |
BF16 |
On-disk format
| Tensor | dtype | Notes |
|---|---|---|
*.weight |
uint8 | Packed FP4 E2M1, 2 values per byte |
*.weight_scale |
float8_e4m3fn | Per-block scale (16-element groups along input dim) |
*.weight_scale_2 |
float32 | Per-tensor scalar scale |
Disk size: ~26 GB (vs ~62 GB BF16, ~57% reduction).
Inference via vllm-omni
from vllm_omni import Omni
omni = Omni(model="YihongJin/Qwen3-Omni-30B-A3B-Instruct-NVFP4-W4A16-awq")
Or via OpenAI-compatible server:
vllm serve YihongJin/Qwen3-Omni-30B-A3B-Instruct-NVFP4-W4A16-awq \
--omni --port 8000
Compute requirement: sm_80+ (Ampere or newer) for the FP4 Marlin GEMM path. On sm_120 (RTX 5090, Pro 6000) and sm_90 (H100/H200) the dequant-then-BF16 matmul kernel handles W4A16 correctly.
Note for vLLM ≤ v0.21.x users: The MoE FP4 Marlin selection that makes this checkpoint work needs vLLM PR #42566. vllm-omni's upcoming PR adds a
patch.pybackport that self-extinguishes once vLLM ≥0.22 ships.
Calibration recipe
- Base model:
Qwen/Qwen3-Omni-30B-A3B-Instructinbfloat16 - ModelOpt:
nvidia-modelopt[torch] >= 0.42(tested with 0.45.0.dev) - Config:
mtq.W4A16_NVFP4_CFG+ algorithmawq_clip - Samples: 1024 (wikitext-103-raw-v1 + openai_humaneval mix; truncated to 512 tokens)
- Excluded patterns:
*audio_tower*,*visual*,*talker*,*code2wav*,*lm_head*,*mlp.gate*
The exported hf_quant_config.json enumerates *mlp.gate* and *lm_head* in exclude_modules (added post-export because ModelOpt's exporter silently drops these from the wildcard exclusions when calibration was done correctly — see the vllm-omni PR for the upstream-tracking patch).
Variant: -max
A sibling checkpoint calibrated with ModelOpt's simpler max algorithm:
YihongJin/Qwen3-Omni-30B-A3B-Instruct-NVFP4-W4A16-max.
On the same benchmarks: MMLU 0.74, GSM8K 0.92, Daily-Omni 0.74 — slightly different accuracy/cost tradeoff (faster to calibrate, no AWQ pass).
License
Apache-2.0 (inherits from the base Qwen3-Omni-30B-A3B-Instruct model). See the base model page for additional usage terms.
W4A4 NVFP4 siblings
Same model, broader quantization surface (weights and activations to FP4) for Blackwell-only deployments:
YihongJin/Qwen3-Omni-30B-A3B-Instruct-NVFP4-W4A4-full-thinker-awqclip— production W4A4, AWQ-clip text calibration, B200 throughput +33% vs BF16 at conc=32. Requiresvllm-omniwith the load-time NaN clamp from vllm-project/vllm-omni#4025 (defensive override, no-op for clean checkpoints).
- Downloads last month
- 264
Model tree for YihongJin/Qwen3-Omni-30B-A3B-Instruct-NVFP4-W4A16-awq
Base model
Qwen/Qwen3-Omni-30B-A3B-Instruct