Instructions to use 88plug/Qwen3-Omni-30B-W4A16 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use 88plug/Qwen3-Omni-30B-W4A16 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="88plug/Qwen3-Omni-30B-W4A16") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoProcessor, AutoModelForMultimodalLM processor = AutoProcessor.from_pretrained("88plug/Qwen3-Omni-30B-W4A16") model = AutoModelForMultimodalLM.from_pretrained("88plug/Qwen3-Omni-30B-W4A16") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use 88plug/Qwen3-Omni-30B-W4A16 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "88plug/Qwen3-Omni-30B-W4A16" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "88plug/Qwen3-Omni-30B-W4A16", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/88plug/Qwen3-Omni-30B-W4A16
- SGLang
How to use 88plug/Qwen3-Omni-30B-W4A16 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "88plug/Qwen3-Omni-30B-W4A16" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "88plug/Qwen3-Omni-30B-W4A16", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "88plug/Qwen3-Omni-30B-W4A16" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "88plug/Qwen3-Omni-30B-W4A16", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use 88plug/Qwen3-Omni-30B-W4A16 with Docker Model Runner:
docker model run hf.co/88plug/Qwen3-Omni-30B-W4A16
Qwen3-Omni-30B-A3B-W4A16
INT4 post-training quantization of Qwen/Qwen3-Omni-30B-A3B-Instruct — the 30B omni model with audio, vision, and speech generation. ~22 GB on disk. Runs on a single RTX 4090 or A6000.
Attention-only quantization via AutoRound W4A16 G32. MoE expert weights stay BF16 — no quality sacrifice on the sparse expert path.
At a Glance
| Property | Value |
|---|---|
| Base model | Qwen/Qwen3-Omni-30B-A3B-Instruct |
| Architecture | Sparse MoE + Whisper audio + ViT vision + speech decoder |
| Quant method | AutoRound W4A16, group size 32 |
| Quant format | compressed-tensors (native vLLM) |
| Quantized | 48 attention layers (q/k/v/o_proj) — 192 tensors total |
| Kept BF16 | MoE experts, audio_tower, visual, talker, code2wav |
| Disk size | ~22 GB |
| Min GPU | 1× RTX 4090 24GB or A6000 48GB |
Why attention-only?
Qwen3-Omni's MoE expert weights are sparse by construction — aggressively compressing them trades quality for minimal size gain. The 192 attention projections (48 layers × q/k/v/o_proj) are the quality-critical path: they participate in every token of every forward pass. Quantizing only those at W4 achieves the large memory reduction while leaving the routed and shared experts untouched at BF16.
This is also why this model has no full-W4 compressed-tensors release yet — this is the first.
Memory Requirements
| Configuration | BF16 | W4A16 (this model) |
|---|---|---|
| Weights on disk | ~60 GB | ~22 GB |
| VRAM at batch=1, 32k ctx | ~66 GB | ~23 GB |
| Min GPU | 2× A100 40GB | 1× RTX 4090 24GB |
Quick Start
Tested with vLLM v0.21.0 (vllm/vllm-openai:v0.21.0-cu129-ubuntu2404). Weights are in compressed-tensors format — vLLM detects and loads quantization automatically. No --quantization flag needed.
vLLM — text output only
docker run --gpus device=0 -p 8080:8080 \
vllm/vllm-openai:v0.21.0-cu129-ubuntu2404 vllm serve \
88plug/Qwen3-Omni-30B-W4A16 \
--kv-cache-dtype fp8 \
--max-model-len 32768 \
--gpu-memory-utilization 0.92
Weights are in compressed-tensors format — no --quantization flag needed. Mainline vLLM returns text only. Audio input works; speech output does not.
vLLM-Omni — full audio output
vLLM-Omni enables real-time speech output. Required if you need the model to speak.
docker run --gpus device=0 -p 8080:8080 \
vllm-omni-image vllm serve \
88plug/Qwen3-Omni-30B-W4A16 \
--kv-cache-dtype fp8 \
--max-model-len 32768 \
--gpu-memory-utilization 0.92
Recommended Sampling Parameters
| Mode | Temperature | Top-P | Top-K | Min-P | Use When |
|---|---|---|---|---|---|
| Thinking (default) | 0.6 | 0.95 | 20 | 0.0 | Reasoning, math, code |
| Non-thinking | 0.7 | 0.8 | 20 | 0.0 | Chat, creative, fast response |
Enable/disable thinking via chat_template_kwargs={"enable_thinking": True/False}. Default is thinking-enabled.
Quantization Recipe
| Parameter | Value |
|---|---|
| Method | AutoRound |
| Scheme | W4A16 |
| Group size | 32 |
| Targets | re:.*self_attn\.(q|k|v|o)_proj$ (192 modules) |
| Ignored | lm_head, embed_tokens, norm, audio_tower, visual, talker, code2wav |
| Calibration data | 75% UltraChat-200k + 25% WikiText-103 |
| Calibration samples | 1024 × 2048 tokens |
| Iterations | 200 |
What's Quantized, What's Not
| Component | Precision | Reason |
|---|---|---|
| Attention q/k/v/o_proj (all 48 layers) | W4A16 INT4 | Quantized — quality-critical path |
| MoE experts (routed + shared) | BF16 | Sparse weights — kept intact for quality |
model.thinker.audio_tower.* |
BF16 | Whisper encoder — excluded |
model.thinker.visual.* |
BF16 | ViT — excluded |
model.talker.* |
BF16 | Speech decoder — excluded |
model.code2wav.* |
BF16 | Waveform codec — excluded |
| Embeddings, LM head, norms | BF16 | Standard practice |
Comparison to Other W4 Releases
| Model | Method | Coverage | Speech output | Format |
|---|---|---|---|---|
| 88plug/Qwen3-Omni-30B-W4A16 (this) | AutoRound W4A16 G32 | Attention-only (MoE BF16) | Yes (vLLM-Omni) | compressed-tensors |
| Intel/Qwen3-Omni-30B-A3B-Instruct-int4-AutoRound | AutoRound W4 | Full model — all Linear | Yes | compressed-tensors |
| cyankiwi/Qwen3-Omni-30B-A3B-Instruct-AWQ-4bit | AWQ | Full model | Yes | AWQ |
| ggml-org/Qwen3-Omni-30B-A3B-Instruct-GGUF | GGUF | Full model | No (mainline llama.cpp) | GGUF |
Key differentiator vs Intel's AutoRound release: Intel quantizes all Linear layers uniformly to W4, including MoE experts. This release leaves MoE experts at BF16, preserving expert routing quality. The tradeoff is slightly larger size in exchange for higher output fidelity on complex reasoning and long-context tasks.
This is the only W4A16 compressed-tensors release for this model.
SGLang
SGLang compressed-tensors support is under active development. For baseline throughput comparisons, run the BF16 base model:
python -m sglang.launch_server \
--model-path Qwen/Qwen3-Omni-30B-A3B-Instruct \
--tp 2 \
--port 30000
SGLang compressed-tensors support — verify with your SGLang version before production use. Speech output (talker/code2wav) requires vLLM-Omni regardless.
llama.cpp / GGUF
For CPU inference or Apple Silicon, use the GGUF variant from ggml-org or unsloth:
# llama.cpp (text-only; speech output not supported in mainline llama.cpp)
llama-cli -hf ggml-org/Qwen3-Omni-30B-A3B-Instruct-GGUF \
-m Qwen3-Omni-30B-A3B-Instruct-Q4_K_M.gguf \
--chat-format chatml
Note: GGUF runs on CPU and Apple Silicon without vLLM. Speech synthesis (talker + code2wav pipeline) is not available in mainline llama.cpp — use vLLM-Omni for full audio output. This compressed-tensors W4A16 checkpoint is optimized for GPU serving with vLLM.
Limitations
- Mainline vLLM (text-only): Audio input is supported; speech output requires vLLM-Omni. The
talkerandcode2wavcomponents are BF16 and inactive in standard vLLM serve. - MoE expert quality: Expert weights remain at BF16 — this quant targets attention projections only. Full-W4 releases (e.g., Intel's) quantize experts too, trading quality for additional size reduction.
- Audio/vision tower untouched:
audio_tower,visual,talker, andcode2wavare fully BF16. Quantization affects the LLM thinker backbone only — audio/vision pipelines are identical to the BF16 base model. - Group size 32: Finer-grained than standard G128, improving quality on attention heads at modest overhead.
- vLLM ≥ v0.21.0 required: Older vLLM versions do not support compressed-tensors natively.
- Context length: Tested up to 32k. Qwen3-Omni supports up to 128k — longer contexts require proportionally more KV cache VRAM.
Quality Targets
| Metric | Target |
|---|---|
| KL divergence vs BF16 | < 0.014 |
| MMLU recovery | ≥ 99% |
| RULER@128k recovery | ≥ 97% |
| ASR WER delta | ≤ +0.5% |
The attention-only approach keeps the audio and voice pipeline entirely at BF16, so ASR and speech generation quality are unaffected by the quantization.
Benchmarks
Results pending.
| Engine | Format | Batch | ctx | tok/s | TTFT p50 | TTFT p99 | VRAM |
|---|---|---|---|---|---|---|---|
| vLLM v0.21.0 | W4A16 | 1 | 32k | — | — | — | — |
| vLLM v0.21.0 | W4A16 | 8 | 32k | — | — | — | — |
Hardware: RTX 4090 24 GB, CUDA 12.9, driver 570.
Citation
@misc{qwen3technicalreport,
title = {Qwen3 Technical Report},
author = {Qwen Team},
year = {2025},
url = {https://huggingface.co/Qwen/Qwen3-Omni-30B-A3B-Instruct}
}
About
88plug AI Lab produces production-grade compressed-tensors quantizations of frontier LLMs, VLMs, and omni models — built for native vLLM v0.21.0+ deployment with zero extra flags.
W8A16 — INT8 weights + BF16 activations. Near-lossless on any Ampere+ GPU. Runs where FP8 hardware cannot.
W4A16 — AutoRound with iters=200 and a mixed calibration corpus. Targets ≥ 99% MMLU recovery — the quality bar that makes W4A16 viable for production.
All weights are in compressed-tensors format. vLLM detects quantization automatically from quantization_config in config.json. No --quantization flag required.
Also available: Qwen3-Omni-30B-A3B-W8A16 (INT8, ~33 GB) · Qwen3-Omni-30B-A3B-W4A16 (INT4, ~22 GB)
Browse all releases → huggingface.co/88plug
- Downloads last month
- 193
Model tree for 88plug/Qwen3-Omni-30B-W4A16
Base model
Qwen/Qwen3-Omni-30B-A3B-Instruct