Qwen3-Omni-30B-A3B-W4A16

INT4 post-training quantization of Qwen/Qwen3-Omni-30B-A3B-Instruct — the 30B omni model with audio, vision, and speech generation. ~22 GB on disk. Runs on a single RTX 4090 or A6000.

Attention-only quantization via AutoRound W4A16 G32. MoE expert weights stay BF16 — no quality sacrifice on the sparse expert path.


At a Glance

Property Value
Base model Qwen/Qwen3-Omni-30B-A3B-Instruct
Architecture Sparse MoE + Whisper audio + ViT vision + speech decoder
Quant method AutoRound W4A16, group size 32
Quant format compressed-tensors (native vLLM)
Quantized 48 attention layers (q/k/v/o_proj) — 192 tensors total
Kept BF16 MoE experts, audio_tower, visual, talker, code2wav
Disk size ~22 GB
Min GPU 1× RTX 4090 24GB or A6000 48GB

Why attention-only?

Qwen3-Omni's MoE expert weights are sparse by construction — aggressively compressing them trades quality for minimal size gain. The 192 attention projections (48 layers × q/k/v/o_proj) are the quality-critical path: they participate in every token of every forward pass. Quantizing only those at W4 achieves the large memory reduction while leaving the routed and shared experts untouched at BF16.

This is also why this model has no full-W4 compressed-tensors release yet — this is the first.


Memory Requirements

Configuration BF16 W4A16 (this model)
Weights on disk ~60 GB ~22 GB
VRAM at batch=1, 32k ctx ~66 GB ~23 GB
Min GPU 2× A100 40GB 1× RTX 4090 24GB

Quick Start

Tested with vLLM v0.21.0 (vllm/vllm-openai:v0.21.0-cu129-ubuntu2404). Weights are in compressed-tensors format — vLLM detects and loads quantization automatically. No --quantization flag needed.

vLLM — text output only

docker run --gpus device=0 -p 8080:8080 \
  vllm/vllm-openai:v0.21.0-cu129-ubuntu2404 vllm serve \
  88plug/Qwen3-Omni-30B-W4A16 \
  --kv-cache-dtype fp8 \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.92

Weights are in compressed-tensors format — no --quantization flag needed. Mainline vLLM returns text only. Audio input works; speech output does not.

vLLM-Omni — full audio output

vLLM-Omni enables real-time speech output. Required if you need the model to speak.

docker run --gpus device=0 -p 8080:8080 \
  vllm-omni-image vllm serve \
  88plug/Qwen3-Omni-30B-W4A16 \
  --kv-cache-dtype fp8 \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.92

Recommended Sampling Parameters

Mode Temperature Top-P Top-K Min-P Use When
Thinking (default) 0.6 0.95 20 0.0 Reasoning, math, code
Non-thinking 0.7 0.8 20 0.0 Chat, creative, fast response

Enable/disable thinking via chat_template_kwargs={"enable_thinking": True/False}. Default is thinking-enabled.


Quantization Recipe

Parameter Value
Method AutoRound
Scheme W4A16
Group size 32
Targets re:.*self_attn\.(q|k|v|o)_proj$ (192 modules)
Ignored lm_head, embed_tokens, norm, audio_tower, visual, talker, code2wav
Calibration data 75% UltraChat-200k + 25% WikiText-103
Calibration samples 1024 × 2048 tokens
Iterations 200

What's Quantized, What's Not

Component Precision Reason
Attention q/k/v/o_proj (all 48 layers) W4A16 INT4 Quantized — quality-critical path
MoE experts (routed + shared) BF16 Sparse weights — kept intact for quality
model.thinker.audio_tower.* BF16 Whisper encoder — excluded
model.thinker.visual.* BF16 ViT — excluded
model.talker.* BF16 Speech decoder — excluded
model.code2wav.* BF16 Waveform codec — excluded
Embeddings, LM head, norms BF16 Standard practice

Comparison to Other W4 Releases

Model Method Coverage Speech output Format
88plug/Qwen3-Omni-30B-W4A16 (this) AutoRound W4A16 G32 Attention-only (MoE BF16) Yes (vLLM-Omni) compressed-tensors
Intel/Qwen3-Omni-30B-A3B-Instruct-int4-AutoRound AutoRound W4 Full model — all Linear Yes compressed-tensors
cyankiwi/Qwen3-Omni-30B-A3B-Instruct-AWQ-4bit AWQ Full model Yes AWQ
ggml-org/Qwen3-Omni-30B-A3B-Instruct-GGUF GGUF Full model No (mainline llama.cpp) GGUF

Key differentiator vs Intel's AutoRound release: Intel quantizes all Linear layers uniformly to W4, including MoE experts. This release leaves MoE experts at BF16, preserving expert routing quality. The tradeoff is slightly larger size in exchange for higher output fidelity on complex reasoning and long-context tasks.

This is the only W4A16 compressed-tensors release for this model.

SGLang

SGLang compressed-tensors support is under active development. For baseline throughput comparisons, run the BF16 base model:

python -m sglang.launch_server \
  --model-path Qwen/Qwen3-Omni-30B-A3B-Instruct \
  --tp 2 \
  --port 30000

SGLang compressed-tensors support — verify with your SGLang version before production use. Speech output (talker/code2wav) requires vLLM-Omni regardless.


llama.cpp / GGUF

For CPU inference or Apple Silicon, use the GGUF variant from ggml-org or unsloth:

# llama.cpp (text-only; speech output not supported in mainline llama.cpp)
llama-cli -hf ggml-org/Qwen3-Omni-30B-A3B-Instruct-GGUF \
  -m Qwen3-Omni-30B-A3B-Instruct-Q4_K_M.gguf \
  --chat-format chatml

Note: GGUF runs on CPU and Apple Silicon without vLLM. Speech synthesis (talker + code2wav pipeline) is not available in mainline llama.cpp — use vLLM-Omni for full audio output. This compressed-tensors W4A16 checkpoint is optimized for GPU serving with vLLM.


Limitations

  • Mainline vLLM (text-only): Audio input is supported; speech output requires vLLM-Omni. The talker and code2wav components are BF16 and inactive in standard vLLM serve.
  • MoE expert quality: Expert weights remain at BF16 — this quant targets attention projections only. Full-W4 releases (e.g., Intel's) quantize experts too, trading quality for additional size reduction.
  • Audio/vision tower untouched: audio_tower, visual, talker, and code2wav are fully BF16. Quantization affects the LLM thinker backbone only — audio/vision pipelines are identical to the BF16 base model.
  • Group size 32: Finer-grained than standard G128, improving quality on attention heads at modest overhead.
  • vLLM ≥ v0.21.0 required: Older vLLM versions do not support compressed-tensors natively.
  • Context length: Tested up to 32k. Qwen3-Omni supports up to 128k — longer contexts require proportionally more KV cache VRAM.

Quality Targets

Metric Target
KL divergence vs BF16 < 0.014
MMLU recovery ≥ 99%
RULER@128k recovery ≥ 97%
ASR WER delta ≤ +0.5%

The attention-only approach keeps the audio and voice pipeline entirely at BF16, so ASR and speech generation quality are unaffected by the quantization.


Benchmarks

Results pending.

Engine Format Batch ctx tok/s TTFT p50 TTFT p99 VRAM
vLLM v0.21.0 W4A16 1 32k
vLLM v0.21.0 W4A16 8 32k

Hardware: RTX 4090 24 GB, CUDA 12.9, driver 570.


Citation

@misc{qwen3technicalreport,
  title  = {Qwen3 Technical Report},
  author = {Qwen Team},
  year   = {2025},
  url    = {https://huggingface.co/Qwen/Qwen3-Omni-30B-A3B-Instruct}
}

About

88plug AI Lab produces production-grade compressed-tensors quantizations of frontier LLMs, VLMs, and omni models — built for native vLLM v0.21.0+ deployment with zero extra flags.

W8A16 — INT8 weights + BF16 activations. Near-lossless on any Ampere+ GPU. Runs where FP8 hardware cannot.

W4A16 — AutoRound with iters=200 and a mixed calibration corpus. Targets ≥ 99% MMLU recovery — the quality bar that makes W4A16 viable for production.

All weights are in compressed-tensors format. vLLM detects quantization automatically from quantization_config in config.json. No --quantization flag required.

Also available: Qwen3-Omni-30B-A3B-W8A16 (INT8, ~33 GB) · Qwen3-Omni-30B-A3B-W4A16 (INT4, ~22 GB)

Browse all releases → huggingface.co/88plug

Downloads last month
193
Safetensors
Model size
35B params
Tensor type
BF16
·
I8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for 88plug/Qwen3-Omni-30B-W4A16

Quantized
(24)
this model