--- license: apache-2.0 tags: - text-to-speech - onnx - voice-cloning - cpu-inference - qwen3-tts pipeline_tag: text-to-speech library_name: onnxruntime base_model: Qwen/Qwen3-TTS-12Hz-1.7B-Base --- # 🎙️ Qwen3-TTS-12Hz-1.7B-Base (ONNX) ## 🚀 Overview **Qwen3-TTS-12Hz-1.7B-Base-ONNX** is the optimization of the Qwen3-TTS framework. This model implements a discrete multi-codec Language Model (LM) architecture capable of **3-second rapid voice cloning** with enhanced prosody and vocal fidelity. The ONNX conversion enables low-latency, cross-platform deployment on both high-end CPUs and NVIDIA GPUs. ## 💎 Key Features * **Zero-Shot Voice Cloning**: High-similarity cloning (>97%) using only 3 seconds of reference audio. * **Ultra-Low Latency**: End-to-end streaming generation as low as **97ms**. * **Decoupled Architecture**: Separate components for text processing, token generation, and speech synthesis. * **Multilingual Excellence**: Native-level pronunciation for 10 major global languages. * **Vocal Richness**: 2048-dimensional speaker embeddings for superior similarity. ## 🏗️ Model Architecture A complex modular pipeline consisting of: * **Talker (Transformer)**: 28 layers (Hidden Size: 2048, 8 KV Heads). * **Code Predictor**: 5-layer Transformer for multi-codec resolution. * **Vocoder**: BigVGAN-based high-fidelity speech decoder. * **Speaker Encoder**: ECAPA-TDNN for embedding extraction. ## 📦 Model Components (Modular Specs) | Component | File | Description | Output | | :--- | :--- | :--- | :--- | | **Talker Prefill** | `talker_prefill.onnx` | Initial text processing & KV Cache setup. | Logits & Hidden states. | | **Talker Decode** | `talker_decode.onnx` | Iterative token generation logic. | New KV Cache. | | **Code Predictor** | `code_predictor.onnx` | Multi-codec prediction (12Hz). | Multi-codebook codes. | | **Vocoder** | `vocoder.onnx` | Final waveform synthesis. | 24kHz Audio. | | **Speaker Enc.** | `speaker_encoder.onnx` | Reference audio analysis. | 2048-dim Embedding. | ## 🛠️ Installation ```bash pip install onnxruntime-gpu librosa soundfile numpy torch transformers