Qwen3-TTS 12Hz 0.6B Base — ONNX

ONNX export of Qwen/Qwen3-TTS-12Hz-0.6B-Base for local inference with ONNX Runtime. Includes an ECAPA-TDNN speaker encoder for voice cloning from ~3 seconds of reference audio. This is an unofficial community mirror of the ONNX export; it is not a newly trained model. The Qwen team (Alibaba Cloud) is the original author.

Source

Field Value
Upstream model Qwen/Qwen3-TTS-12Hz-0.6B-Base
Upstream source revision 5d83992436eae1d760afd27aff78a71d676296fc
Packaging source revision 17a2fccf89a5391005f9ff163b07e13f7814dddf
Export tool/script ONNX export from upstream Qwen3-TTS PyTorch weights (community packaging)
Quantization recipe See onnx/ filenames for FP32/FP16/quant variants shipped in this repo

Files

File Description Size
speaker_encoder.onnx + .data ECAPA-TDNN speaker encoder ~34 MB
talker_prefill.onnx + .data Talker LM prefill (28 layers) ~1.7 GB
talker_decode.onnx + .data Talker LM single-step decode ~1.7 GB
code_predictor.onnx Code Predictor (5 layers, 15 groups) ~440 MB
vocoder.onnx Vocoder decoder (24kHz output) ~2.7 MB
embeddings/ Text/codec embeddings as .npy + config ~1.4 GB
tokenizer/ BPE tokenizer (vocab.json, merges.txt) ~4 MB

Architecture

  • Speaker Encoder: ECAPA-TDNN, 128 mel bins input, 1024-dim speaker embedding output
  • Talker: 28 transformer layers, 16 attn heads, 8 KV heads, hidden=1024
  • Code Predictor: 5 layers, generates codebook groups 1-15 from Talker output
  • Vocoder: RVQ dequantize -> transformer -> BigVGAN decoder, 12Hz codec -> 24kHz audio

Intended Use

Multilingual text-to-speech for local inference via ONNX Runtime. The Base variant synthesizes speech conditioned on a speaker embedding extracted from a short reference clip, enabling voice-consistent synthesis. The 1.7B Base variant is in tonythethompson/Qwen3-TTS-12Hz-1.7B-Base-ONNX; predefined-speaker variants are in the CustomVoice repos.

Standalone usage (external project)

The snippet below uses the external ElBruno/QwenTTS C# wrapper and references that project's ONNX repo, not this one. It is included only as a reference for standalone C# use.

dotnet add package ElBruno.QwenTTS.VoiceCloning
using ElBruno.QwenTTS.VoiceCloning.Pipeline;

var cloner = await VoiceClonePipeline.CreateAsync();
await cloner.SynthesizeAsync("Hello world!", "reference.wav", "output.wav", "english");

Runtime Notes

  • Designed for ONNX Runtime compatible runtimes.
  • Output sample rate: 24 kHz.
  • Voice cloning reference: ~3 seconds of reference audio recommended.
  • Validate on the target execution provider before production use.

Precision and Packaging

Export tooling, precision, and quantization are recorded in the Source table above. This packaging mirror does not publish independent parity benchmarks; validate on your target execution provider before production use.

Limitations

  • Voice cloning quality depends on reference audio quality; noisy or very short clips may degrade results.
  • Multilingual capability: verify the upstream model card for supported languages and quality by language.
  • No repository-specific audio quality evaluation is documented here.

Safety and Responsible Use

Qwen3-TTS is a voice synthesis and voice cloning model capable of producing realistic speech closely matching a target speaker.

  • Do not use to impersonate real individuals without their explicit consent.
  • Do not generate synthetic speech intended to deceive listeners about a speaker's identity.
  • Disclose AI-generated audio where listeners would reasonably expect a human voice.
  • Users are responsible for compliance with applicable laws governing synthetic media and voice cloning in their jurisdiction.

License

Apache 2.0 — same as Qwen/Qwen3-TTS-12Hz-0.6B-Base. This packaging repo adds no new license terms.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for tonythethompson/Qwen3-TTS-12Hz-0.6B-Base-ONNX

Quantized
(18)
this model