--- license: cc-by-nc-sa-4.0 base_model: Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign pipeline_tag: text-to-speech library_name: transformers language: - en tags: - tts - qwen - qwen3 - qwen3-tts - voice-design - lora - fine-tuned - audio - expressive --- # Qwen3-TTS VoiceDesign — T5 A fine-tune of `Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign` focused on **expressive prompt following** — emotion, pace, and affect controllability under free-form English voice descriptions. The model trades a small amount of intelligibility headroom for a substantially more expressive output: prompts that ask for *sad*, *whispered*, *projected*, *sarcastic*, *bedtime-storyteller*, etc. are noticeably closer to what the description asks for than the base model produces. - **Base:** Qwen3-TTS-12Hz-1.7B-VoiceDesign (frozen during training, LoRA adapter merged into this repo) - **Method:** LoRA on the Talker's attention + MLP projections, merged back into the base weights - **Training data:** EARS (rich-style multi-speaker reads) + Expresso (high-quality expressive performances) with free-form natural-language captions - **Output:** 24 kHz mono wav via the Qwen3 12 Hz multi-codebook codec This repo is **self-contained** — it ships the merged transformer weights, the audio codec (`speech_tokenizer/`), the tokenizer, and all configs. No other HF repo needs to be downloaded at inference time. ## Quick start Install the Qwen3-TTS inference package (it registers the custom `Qwen3TTSForConditionalGeneration` model class with `transformers`): ```bash pip install qwen-tts transformers torch soundfile ``` Generate a clip: ```python from qwen_tts import Qwen3TTSModel import soundfile as sf wrap = Qwen3TTSModel.from_pretrained("macminix/qwen3_voice_design_t5") wavs, sr = wrap.generate_voice_design( text="Come and look at this, you are not going to believe it.", instruct="A male speaker delivers his happy speech at a moderate pace with standard energy.", language="english", temperature=0.9, top_k=50, top_p=1.0, repetition_penalty=1.05, max_new_tokens=600, ) sf.write("out.wav", wavs[0], sr) ``` A ready-to-run version with three example prompts is provided at [`example_inference.py`](example_inference.py). ### The `instruct` prompt format The `instruct` field is free-form English describing the voice. The training distribution covers: - **gender** — *"a male/female speaker"*, *"a deep-voiced narrator"* - **pitch** — *"high/medium/low pitched"*, *"deep"*, *"thin and high"* - **speed** — *"slowly"*, *"at a brisk pace"*, *"at a moderate tempo"* - **affect / emotion** — *"happy"*, *"angry"*, *"sad"*, *"whispered"*, *"sarcastic"*, *"projected"* - **scene / persona** — *"a bedtime storyteller"*, *"a news anchor"*, *"a sports announcer at the climax of a play"* Example prompts: ``` A male speaker delivers his happy speech at a moderate pace with standard energy. A female voice speaks softly with a sad tone, low energy, almost whispering. An older male narrator reads a bedtime story slowly, with warmth. A high-pitched announcer projects an exciting headline at a fast pace. ``` ## How the adapter was trained This adapter follows a corrected training protocol designed to fix four silent issues common in earlier naive recipes for VoiceDesign: 1. **Dual-track input layout.** Training-time `inputs_embeds` is built by the exact element-wise sum of text-track and codec-track embeddings used by `Qwen3TTSForConditionalGeneration.generate`'s VoiceDesign path — including the 5-position English think-prefix on the codec track. This matches inference exactly, instead of approximating it with a chat-templated prompt + boundary switch. 2. **Single-shift loss.** Labels are computed manually as `F.cross_entropy(logits[:, :-1], codec_0_labels[:, 1:], ignore_index=-100)`. The `labels=` argument is never passed into the wrapped forward, avoiding the double-shift that occurs when PEFT's wrapped CausalLMLoss adds its own internal shift on top of the collator's. 3. **Conservative LR for LoRA on a 1.7 B base.** Peak LR `2.0e-5`, cosine schedule, with `min_lr_ratio=0.2` so the late-training LR stays high enough to keep learning rather than plateauing. 4. **No sub-talker loss with a frozen Code Predictor.** The sub-talker auxiliary loss is disabled (`weight=0.0`) when the Code Predictor isn't part of the LoRA scope — this combination is known to corrupt training. The adapter is LoRA `r=16, α=32, dropout=0.05` on the Talker's `q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj` projections only. The Code Predictor and audio codec are frozen end-to-end. Training data combines EARS (clean multi-speaker reads with style descriptors) and Expresso (high-quality expressive performances at 48 kHz, downsampled to 24 kHz to match the base's native rate). Captions are free-form natural-language prose, one canonical caption per clip — no templated descriptions. The final adapter (~19 M parameters, ~77 MB at fp32) was permanently merged into the Talker weights for this repo so inference does not require PEFT. ## Strengths - **Better emotion + affect rendering** on prompts that ask for it (e.g. *whispered*, *sad*, *projected*, *bedtime-storyteller*) versus the base model. - **Better persona / scene composition** — prompts that combine an emotion with a scene (a stern parent, an excited sports announcer) come through more clearly. - **No identity drift on neutral prompts.** Plain "a clear neutral voice" prompts produce output that is acoustically close to the base model — the adapter doesn't "color" everything. ## Known limitations - **Gender drift on strong-emotion prompts.** Some `sad_male`, `sad_female`, and `fear_female` prompts can render with the wrong-gender timbre. Root cause: the training corpora's emotion-axis coverage is concentrated on a handful of speakers, so strongly emotional descriptions act partially as speaker-identity cues. Mitigation in the prompt: lead with the gender (*"A male speaker, sad and quiet, …"*) rather than the emotion. - **Slight robotic tone on extreme prompts.** A small number of `fear_male_normal_slow` and similar prompts produce flatter prosody than the base. Trade-off accepted in exchange for the broader expressive lift. - **English only.** All training and evaluation used English prompts and English text. The base model supports 10 languages; they are untouched but not validated against this adapter's modified CB-0 distribution. - **Research / non-commercial use only** — see license. ## License - Base model weights (`Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign`): **Apache 2.0**. - Training data: - **EARS:** CC BY-NC-SA 4.0 (research / non-commercial). - **Expresso:** CC BY-NC 4.0 (research / non-commercial). Because both training corpora carry non-commercial restrictions, the derived model effectively inherits a **CC BY-NC-SA 4.0** constraint: free to use for research, academic, and non-commercial purposes, with attribution and share-alike. Commercial deployment is not recommended without re-training on a commercially-licensed corpus. ## References - Base model: [Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign](https://huggingface.co/Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign) - Inference library: [`qwen-tts` on PyPI](https://pypi.org/project/qwen-tts/) - EARS dataset: [Effortless and Realistic Speech Dataset](https://github.com/facebookresearch/ears_dataset) - Expresso dataset: [`ylacombe/expresso`](https://huggingface.co/datasets/ylacombe/expresso)