Qwen3-TTS-12Hz-0.6B-Base-ONNX

Qwen3-TTS 12Hz 0.6B Base — ONNX

ONNX export of Qwen/Qwen3-TTS-12Hz-0.6B-Base for local inference with ONNX Runtime. Includes an ECAPA-TDNN speaker encoder for voice cloning from ~3 seconds of reference audio. This is an unofficial community mirror of the ONNX export; it is not a newly trained model. The Qwen team (Alibaba Cloud) is the original author.

Source

Field	Value
Upstream model	Qwen/Qwen3-TTS-12Hz-0.6B-Base
Upstream source revision	`5d83992436eae1d760afd27aff78a71d676296fc`
Packaging source revision	`17a2fccf89a5391005f9ff163b07e13f7814dddf`
Export tool/script	ONNX export from upstream Qwen3-TTS PyTorch weights (community packaging)
Quantization recipe	See `onnx/` filenames for FP32/FP16/quant variants shipped in this repo

Files

File	Description	Size
`speaker_encoder.onnx` + `.data`	ECAPA-TDNN speaker encoder	~34 MB
`talker_prefill.onnx` + `.data`	Talker LM prefill (28 layers)	~1.7 GB
`talker_decode.onnx` + `.data`	Talker LM single-step decode	~1.7 GB
`code_predictor.onnx`	Code Predictor (5 layers, 15 groups)	~440 MB
`vocoder.onnx`	Vocoder decoder (24kHz output)	~2.7 MB
`embeddings/`	Text/codec embeddings as .npy + config	~1.4 GB
`tokenizer/`	BPE tokenizer (vocab.json, merges.txt)	~4 MB

Architecture

Speaker Encoder: ECAPA-TDNN, 128 mel bins input, 1024-dim speaker embedding output
Talker: 28 transformer layers, 16 attn heads, 8 KV heads, hidden=1024
Code Predictor: 5 layers, generates codebook groups 1-15 from Talker output
Vocoder: RVQ dequantize -> transformer -> BigVGAN decoder, 12Hz codec -> 24kHz audio

Intended Use

Multilingual text-to-speech for local inference via ONNX Runtime. The Base variant synthesizes speech conditioned on a speaker embedding extracted from a short reference clip, enabling voice-consistent synthesis. The 1.7B Base variant is in tonythethompson/Qwen3-TTS-12Hz-1.7B-Base-ONNX; predefined-speaker variants are in the CustomVoice repos.

Standalone usage (external project)

The snippet below uses the external ElBruno/QwenTTS C# wrapper and references that project's ONNX repo, not this one. It is included only as a reference for standalone C# use.

dotnet add package ElBruno.QwenTTS.VoiceCloning

using ElBruno.QwenTTS.VoiceCloning.Pipeline;

var cloner = await VoiceClonePipeline.CreateAsync();
await cloner.SynthesizeAsync("Hello world!", "reference.wav", "output.wav", "english");

Runtime Notes

Designed for ONNX Runtime compatible runtimes.
Output sample rate: 24 kHz.
Voice cloning reference: ~3 seconds of reference audio recommended.
Validate on the target execution provider before production use.

Precision and Packaging

Export tooling, precision, and quantization are recorded in the Source table above. This packaging mirror does not publish independent parity benchmarks; validate on your target execution provider before production use.

Limitations

Voice cloning quality depends on reference audio quality; noisy or very short clips may degrade results.
Multilingual capability: verify the upstream model card for supported languages and quality by language.
No repository-specific audio quality evaluation is documented here.

Safety and Responsible Use

Qwen3-TTS is a voice synthesis and voice cloning model capable of producing realistic speech closely matching a target speaker.

Do not use to impersonate real individuals without their explicit consent.
Do not generate synthetic speech intended to deceive listeners about a speaker's identity.
Disclose AI-generated audio where listeners would reasonably expect a human voice.
Users are responsible for compliance with applicable laws governing synthetic media and voice cloning in their jurisdiction.

License

Apache 2.0 — same as Qwen/Qwen3-TTS-12Hz-0.6B-Base. This packaging repo adds no new license terms.

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for tonythethompson/Qwen3-TTS-12Hz-0.6B-Base-ONNX

Base model

Qwen/Qwen3-TTS-12Hz-0.6B-Base

Quantized

(18)

this model