---
license: cc-by-nc-sa-4.0
base_model: Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign
pipeline_tag: text-to-speech
library_name: transformers
language:
  - en
tags:
  - tts
  - qwen
  - qwen3
  - qwen3-tts
  - voice-design
  - lora
  - fine-tuned
  - audio
  - expressive
---

# Qwen3-TTS VoiceDesign — T5

A fine-tune of `Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign` focused on **expressive prompt following** — emotion, pace, and affect controllability under free-form English voice descriptions. The model trades a small amount of intelligibility headroom for a substantially more expressive output: prompts that ask for *sad*, *whispered*, *projected*, *sarcastic*, *bedtime-storyteller*, etc. are noticeably closer to what the description asks for than the base model produces.

- **Base:** Qwen3-TTS-12Hz-1.7B-VoiceDesign (frozen during training, LoRA adapter merged into this repo)
- **Method:** LoRA on the Talker's attention + MLP projections, merged back into the base weights
- **Training data:** EARS (rich-style multi-speaker reads) + Expresso (high-quality expressive performances) with free-form natural-language captions
- **Output:** 24 kHz mono wav via the Qwen3 12 Hz multi-codebook codec

This repo is **self-contained** — it ships the merged transformer weights, the audio codec (`speech_tokenizer/`), the tokenizer, and all configs. No other HF repo needs to be downloaded at inference time.

## Quick start

Install the Qwen3-TTS inference package (it registers the custom `Qwen3TTSForConditionalGeneration` model class with `transformers`):

```bash
pip install qwen-tts transformers torch soundfile
```

Generate a clip:

```python
from qwen_tts import Qwen3TTSModel
import soundfile as sf

wrap = Qwen3TTSModel.from_pretrained("macminix/qwen3_voice_design_t5")

wavs, sr = wrap.generate_voice_design(
    text="Come and look at this, you are not going to believe it.",
    instruct="A male speaker delivers his happy speech at a moderate pace with standard energy.",
    language="english",
    temperature=0.9, top_k=50, top_p=1.0,
    repetition_penalty=1.05, max_new_tokens=600,
)
sf.write("out.wav", wavs[0], sr)
```

A ready-to-run version with three example prompts is provided at [`example_inference.py`](example_inference.py).

### The `instruct` prompt format

The `instruct` field is free-form English describing the voice. The training distribution covers:

- **gender** — *"a male/female speaker"*, *"a deep-voiced narrator"*
- **pitch** — *"high/medium/low pitched"*, *"deep"*, *"thin and high"*
- **speed** — *"slowly"*, *"at a brisk pace"*, *"at a moderate tempo"*
- **affect / emotion** — *"happy"*, *"angry"*, *"sad"*, *"whispered"*, *"sarcastic"*, *"projected"*
- **scene / persona** — *"a bedtime storyteller"*, *"a news anchor"*, *"a sports announcer at the climax of a play"*

Example prompts:

```
A male speaker delivers his happy speech at a moderate pace with standard energy.
A female voice speaks softly with a sad tone, low energy, almost whispering.
An older male narrator reads a bedtime story slowly, with warmth.
A high-pitched announcer projects an exciting headline at a fast pace.
```

## How the adapter was trained

This adapter follows a corrected training protocol designed to fix four silent issues common in earlier naive recipes for VoiceDesign:

1. **Dual-track input layout.** Training-time `inputs_embeds` is built by the exact element-wise sum of text-track and codec-track embeddings used by `Qwen3TTSForConditionalGeneration.generate`'s VoiceDesign path — including the 5-position English think-prefix on the codec track. This matches inference exactly, instead of approximating it with a chat-templated prompt + boundary switch.
2. **Single-shift loss.** Labels are computed manually as `F.cross_entropy(logits[:, :-1], codec_0_labels[:, 1:], ignore_index=-100)`. The `labels=` argument is never passed into the wrapped forward, avoiding the double-shift that occurs when PEFT's wrapped CausalLMLoss adds its own internal shift on top of the collator's.
3. **Conservative LR for LoRA on a 1.7 B base.** Peak LR `2.0e-5`, cosine schedule, with `min_lr_ratio=0.2` so the late-training LR stays high enough to keep learning rather than plateauing.
4. **No sub-talker loss with a frozen Code Predictor.** The sub-talker auxiliary loss is disabled (`weight=0.0`) when the Code Predictor isn't part of the LoRA scope — this combination is known to corrupt training.

The adapter is LoRA `r=16, α=32, dropout=0.05` on the Talker's `q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj` projections only. The Code Predictor and audio codec are frozen end-to-end. Training data combines EARS (clean multi-speaker reads with style descriptors) and Expresso (high-quality expressive performances at 48 kHz, downsampled to 24 kHz to match the base's native rate). Captions are free-form natural-language prose, one canonical caption per clip — no templated descriptions.

The final adapter (~19 M parameters, ~77 MB at fp32) was permanently merged into the Talker weights for this repo so inference does not require PEFT.

## Strengths

- **Better emotion + affect rendering** on prompts that ask for it (e.g. *whispered*, *sad*, *projected*, *bedtime-storyteller*) versus the base model.
- **Better persona / scene composition** — prompts that combine an emotion with a scene (a stern parent, an excited sports announcer) come through more clearly.
- **No identity drift on neutral prompts.** Plain "a clear neutral voice" prompts produce output that is acoustically close to the base model — the adapter doesn't "color" everything.

## Known limitations

- **Gender drift on strong-emotion prompts.** Some `sad_male`, `sad_female`, and `fear_female` prompts can render with the wrong-gender timbre. Root cause: the training corpora's emotion-axis coverage is concentrated on a handful of speakers, so strongly emotional descriptions act partially as speaker-identity cues. Mitigation in the prompt: lead with the gender (*"A male speaker, sad and quiet, …"*) rather than the emotion.
- **Slight robotic tone on extreme prompts.** A small number of `fear_male_normal_slow` and similar prompts produce flatter prosody than the base. Trade-off accepted in exchange for the broader expressive lift.
- **English only.** All training and evaluation used English prompts and English text. The base model supports 10 languages; they are untouched but not validated against this adapter's modified CB-0 distribution.
- **Research / non-commercial use only** — see license.

## License

- Base model weights (`Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign`): **Apache 2.0**.
- Training data:
  - **EARS:** CC BY-NC-SA 4.0 (research / non-commercial).
  - **Expresso:** CC BY-NC 4.0 (research / non-commercial).

Because both training corpora carry non-commercial restrictions, the derived model effectively inherits a **CC BY-NC-SA 4.0** constraint: free to use for research, academic, and non-commercial purposes, with attribution and share-alike. Commercial deployment is not recommended without re-training on a commercially-licensed corpus.

## References

- Base model: [Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign](https://huggingface.co/Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign)
- Inference library: [`qwen-tts` on PyPI](https://pypi.org/project/qwen-tts/)
- EARS dataset: [Effortless and Realistic Speech Dataset](https://github.com/facebookresearch/ears_dataset)
- Expresso dataset: [`ylacombe/expresso`](https://huggingface.co/datasets/ylacombe/expresso)