Qwen3-TTS 12Hz 0.6B Base — ONNX
ONNX export of Qwen/Qwen3-TTS-12Hz-0.6B-Base for local inference with ONNX Runtime. Includes an ECAPA-TDNN speaker encoder for voice cloning from ~3 seconds of reference audio. This is an unofficial community mirror of the ONNX export; it is not a newly trained model. The Qwen team (Alibaba Cloud) is the original author.
Source
| Field | Value |
|---|---|
| Upstream model | Qwen/Qwen3-TTS-12Hz-0.6B-Base |
| Upstream source revision | 5d83992436eae1d760afd27aff78a71d676296fc |
| Packaging source revision | 17a2fccf89a5391005f9ff163b07e13f7814dddf |
| Export tool/script | ONNX export from upstream Qwen3-TTS PyTorch weights (community packaging) |
| Quantization recipe | See onnx/ filenames for FP32/FP16/quant variants shipped in this repo |
Files
| File | Description | Size |
|---|---|---|
speaker_encoder.onnx + .data |
ECAPA-TDNN speaker encoder | ~34 MB |
talker_prefill.onnx + .data |
Talker LM prefill (28 layers) | ~1.7 GB |
talker_decode.onnx + .data |
Talker LM single-step decode | ~1.7 GB |
code_predictor.onnx |
Code Predictor (5 layers, 15 groups) | ~440 MB |
vocoder.onnx |
Vocoder decoder (24kHz output) | ~2.7 MB |
embeddings/ |
Text/codec embeddings as .npy + config | ~1.4 GB |
tokenizer/ |
BPE tokenizer (vocab.json, merges.txt) | ~4 MB |
Architecture
- Speaker Encoder: ECAPA-TDNN, 128 mel bins input, 1024-dim speaker embedding output
- Talker: 28 transformer layers, 16 attn heads, 8 KV heads, hidden=1024
- Code Predictor: 5 layers, generates codebook groups 1-15 from Talker output
- Vocoder: RVQ dequantize -> transformer -> BigVGAN decoder, 12Hz codec -> 24kHz audio
Intended Use
Multilingual text-to-speech for local inference via ONNX Runtime. The Base variant
synthesizes speech conditioned on a speaker embedding extracted from a short reference
clip, enabling voice-consistent synthesis. The 1.7B Base variant is in
tonythethompson/Qwen3-TTS-12Hz-1.7B-Base-ONNX;
predefined-speaker variants are in the CustomVoice repos.
Standalone usage (external project)
The snippet below uses the external ElBruno/QwenTTS C# wrapper and references that project's ONNX repo, not this one. It is included only as a reference for standalone C# use.
dotnet add package ElBruno.QwenTTS.VoiceCloning
using ElBruno.QwenTTS.VoiceCloning.Pipeline;
var cloner = await VoiceClonePipeline.CreateAsync();
await cloner.SynthesizeAsync("Hello world!", "reference.wav", "output.wav", "english");
Runtime Notes
- Designed for ONNX Runtime compatible runtimes.
- Output sample rate: 24 kHz.
- Voice cloning reference: ~3 seconds of reference audio recommended.
- Validate on the target execution provider before production use.
Precision and Packaging
Export tooling, precision, and quantization are recorded in the Source table above. This packaging mirror does not publish independent parity benchmarks; validate on your target execution provider before production use.
Limitations
- Voice cloning quality depends on reference audio quality; noisy or very short clips may degrade results.
- Multilingual capability: verify the upstream model card for supported languages and quality by language.
- No repository-specific audio quality evaluation is documented here.
Safety and Responsible Use
Qwen3-TTS is a voice synthesis and voice cloning model capable of producing realistic speech closely matching a target speaker.
- Do not use to impersonate real individuals without their explicit consent.
- Do not generate synthetic speech intended to deceive listeners about a speaker's identity.
- Disclose AI-generated audio where listeners would reasonably expect a human voice.
- Users are responsible for compliance with applicable laws governing synthetic media and voice cloning in their jurisdiction.
License
Apache 2.0 — same as
Qwen/Qwen3-TTS-12Hz-0.6B-Base. This packaging repo adds no new license terms.
Model tree for tonythethompson/Qwen3-TTS-12Hz-0.6B-Base-ONNX
Base model
Qwen/Qwen3-TTS-12Hz-0.6B-Base