Instructions to use 88plug/Qwen3-Omni-30B-W4A16 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use 88plug/Qwen3-Omni-30B-W4A16 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="88plug/Qwen3-Omni-30B-W4A16")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForMultimodalLM

processor = AutoProcessor.from_pretrained("88plug/Qwen3-Omni-30B-W4A16")
model = AutoModelForMultimodalLM.from_pretrained("88plug/Qwen3-Omni-30B-W4A16")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use 88plug/Qwen3-Omni-30B-W4A16 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "88plug/Qwen3-Omni-30B-W4A16"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "88plug/Qwen3-Omni-30B-W4A16",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/88plug/Qwen3-Omni-30B-W4A16

SGLang

How to use 88plug/Qwen3-Omni-30B-W4A16 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "88plug/Qwen3-Omni-30B-W4A16" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "88plug/Qwen3-Omni-30B-W4A16",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "88plug/Qwen3-Omni-30B-W4A16" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "88plug/Qwen3-Omni-30B-W4A16",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use 88plug/Qwen3-Omni-30B-W4A16 with Docker Model Runner:
```
docker model run hf.co/88plug/Qwen3-Omni-30B-W4A16
```

Qwen3-Omni-30B-A3B-W4A16

INT4 post-training quantization of Qwen/Qwen3-Omni-30B-A3B-Instruct — the 30B omni model with audio, vision, and speech generation. ~22 GB on disk. Runs on a single RTX 4090 or A6000.

Attention-only quantization via AutoRound W4A16 G32. MoE expert weights stay BF16 — no quality sacrifice on the sparse expert path.

At a Glance

Property	Value
Base model	`Qwen/Qwen3-Omni-30B-A3B-Instruct`
Architecture	Sparse MoE + Whisper audio + ViT vision + speech decoder
Quant method	AutoRound W4A16, group size 32
Quant format	compressed-tensors (native vLLM)
Quantized	48 attention layers (q/k/v/o_proj) — 192 tensors total
Kept BF16	MoE experts, audio_tower, visual, talker, code2wav
Disk size	~22 GB
Min GPU	1× RTX 4090 24GB or A6000 48GB

Why attention-only?

Qwen3-Omni's MoE expert weights are sparse by construction — aggressively compressing them trades quality for minimal size gain. The 192 attention projections (48 layers × q/k/v/o_proj) are the quality-critical path: they participate in every token of every forward pass. Quantizing only those at W4 achieves the large memory reduction while leaving the routed and shared experts untouched at BF16.

This is also why this model has no full-W4 compressed-tensors release yet — this is the first.

Memory Requirements

Configuration	BF16	W4A16 (this model)
Weights on disk	~60 GB	~22 GB
VRAM at batch=1, 32k ctx	~66 GB	~23 GB
Min GPU	2× A100 40GB	1× RTX 4090 24GB

Quick Start

Tested with vLLM v0.21.0 (vllm/vllm-openai:v0.21.0-cu129-ubuntu2404). Weights are in compressed-tensors format — vLLM detects and loads quantization automatically. No --quantization flag needed.

vLLM — text output only

docker run --gpus device=0 -p 8080:8080 \
  vllm/vllm-openai:v0.21.0-cu129-ubuntu2404 vllm serve \
  88plug/Qwen3-Omni-30B-W4A16 \
  --kv-cache-dtype fp8 \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.92

Weights are in compressed-tensors format — no --quantization flag needed. Mainline vLLM returns text only. Audio input works; speech output does not.

vLLM-Omni — full audio output

vLLM-Omni enables real-time speech output. Required if you need the model to speak.

docker run --gpus device=0 -p 8080:8080 \
  vllm-omni-image vllm serve \
  88plug/Qwen3-Omni-30B-W4A16 \
  --kv-cache-dtype fp8 \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.92

Recommended Sampling Parameters

Mode	Temperature	Top-P	Top-K	Min-P	Use When
Thinking (default)	0.6	0.95	20	0.0	Reasoning, math, code
Non-thinking	0.7	0.8	20	0.0	Chat, creative, fast response

Enable/disable thinking via chat_template_kwargs={"enable_thinking": True/False}. Default is thinking-enabled.

Quantization Recipe

Parameter	Value
Method	AutoRound
Scheme	W4A16
Group size	32
Targets	`re:.*self_attn\.(q\|k\|v\|o)_proj$` (192 modules)
Ignored	`lm_head`, `embed_tokens`, `norm`, `audio_tower`, `visual`, `talker`, `code2wav`
Calibration data	75% UltraChat-200k + 25% WikiText-103
Calibration samples	1024 × 2048 tokens
Iterations	200

What's Quantized, What's Not

Component	Precision	Reason
Attention q/k/v/o_proj (all 48 layers)	W4A16 INT4	Quantized — quality-critical path
MoE experts (routed + shared)	BF16	Sparse weights — kept intact for quality
`model.thinker.audio_tower.*`	BF16	Whisper encoder — excluded
`model.thinker.visual.*`	BF16	ViT — excluded
`model.talker.*`	BF16	Speech decoder — excluded
`model.code2wav.*`	BF16	Waveform codec — excluded
Embeddings, LM head, norms	BF16	Standard practice

Comparison to Other W4 Releases

Model	Method	Coverage	Speech output	Format
88plug/Qwen3-Omni-30B-W4A16 (this)	AutoRound W4A16 G32	Attention-only (MoE BF16)	Yes (vLLM-Omni)	compressed-tensors
Intel/Qwen3-Omni-30B-A3B-Instruct-int4-AutoRound	AutoRound W4	Full model — all Linear	Yes	compressed-tensors
cyankiwi/Qwen3-Omni-30B-A3B-Instruct-AWQ-4bit	AWQ	Full model	Yes	AWQ
ggml-org/Qwen3-Omni-30B-A3B-Instruct-GGUF	GGUF	Full model	No (mainline llama.cpp)	GGUF

Key differentiator vs Intel's AutoRound release: Intel quantizes all Linear layers uniformly to W4, including MoE experts. This release leaves MoE experts at BF16, preserving expert routing quality. The tradeoff is slightly larger size in exchange for higher output fidelity on complex reasoning and long-context tasks.

This is the only W4A16 compressed-tensors release for this model.

SGLang

SGLang compressed-tensors support is under active development. For baseline throughput comparisons, run the BF16 base model:

python -m sglang.launch_server \
  --model-path Qwen/Qwen3-Omni-30B-A3B-Instruct \
  --tp 2 \
  --port 30000

SGLang compressed-tensors support — verify with your SGLang version before production use. Speech output (talker/code2wav) requires vLLM-Omni regardless.

llama.cpp / GGUF

For CPU inference or Apple Silicon, use the GGUF variant from ggml-org or unsloth:

# llama.cpp (text-only; speech output not supported in mainline llama.cpp)
llama-cli -hf ggml-org/Qwen3-Omni-30B-A3B-Instruct-GGUF \
  -m Qwen3-Omni-30B-A3B-Instruct-Q4_K_M.gguf \
  --chat-format chatml

Note: GGUF runs on CPU and Apple Silicon without vLLM. Speech synthesis (talker + code2wav pipeline) is not available in mainline llama.cpp — use vLLM-Omni for full audio output. This compressed-tensors W4A16 checkpoint is optimized for GPU serving with vLLM.

Limitations

Mainline vLLM (text-only): Audio input is supported; speech output requires vLLM-Omni. The talker and code2wav components are BF16 and inactive in standard vLLM serve.
MoE expert quality: Expert weights remain at BF16 — this quant targets attention projections only. Full-W4 releases (e.g., Intel's) quantize experts too, trading quality for additional size reduction.
Audio/vision tower untouched: audio_tower, visual, talker, and code2wav are fully BF16. Quantization affects the LLM thinker backbone only — audio/vision pipelines are identical to the BF16 base model.
Group size 32: Finer-grained than standard G128, improving quality on attention heads at modest overhead.
vLLM ≥ v0.21.0 required: Older vLLM versions do not support compressed-tensors natively.
Context length: Tested up to 32k. Qwen3-Omni supports up to 128k — longer contexts require proportionally more KV cache VRAM.

Quality Targets

Metric	Target
KL divergence vs BF16	< 0.014
MMLU recovery	≥ 99%
RULER@128k recovery	≥ 97%
ASR WER delta	≤ +0.5%

The attention-only approach keeps the audio and voice pipeline entirely at BF16, so ASR and speech generation quality are unaffected by the quantization.

Benchmarks

Results pending.

Engine	Format	Batch	ctx	tok/s	TTFT p50	TTFT p99	VRAM
vLLM v0.21.0	W4A16	1	32k	—	—	—	—
vLLM v0.21.0	W4A16	8	32k	—	—	—	—

Hardware: RTX 4090 24 GB, CUDA 12.9, driver 570.

Citation

@misc{qwen3technicalreport,
  title  = {Qwen3 Technical Report},
  author = {Qwen Team},
  year   = {2025},
  url    = {https://huggingface.co/Qwen/Qwen3-Omni-30B-A3B-Instruct}
}

About

88plug AI Lab produces production-grade compressed-tensors quantizations of frontier LLMs, VLMs, and omni models — built for native vLLM v0.21.0+ deployment with zero extra flags.

W8A16 — INT8 weights + BF16 activations. Near-lossless on any Ampere+ GPU. Runs where FP8 hardware cannot.

W4A16 — AutoRound with iters=200 and a mixed calibration corpus. Targets ≥ 99% MMLU recovery — the quality bar that makes W4A16 viable for production.

All weights are in compressed-tensors format. vLLM detects quantization automatically from quantization_config in config.json. No --quantization flag required.

Also available: Qwen3-Omni-30B-A3B-W8A16 (INT8, ~33 GB) · Qwen3-Omni-30B-A3B-W4A16 (INT4, ~22 GB)

Browse all releases → huggingface.co/88plug

Downloads last month: 193

Safetensors

Model size

35B params

Tensor type

BF16

Model tree for 88plug/Qwen3-Omni-30B-W4A16

Base model

Qwen/Qwen3-Omni-30B-A3B-Instruct

Quantized

(24)

this model