File size: 13,226 Bytes
e863a3a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
---
license: mit
language:
- ja
pipeline_tag: text-to-speech
tags:
- speech
- voice
- tts
base_model:
- Aratako/Irodori-TTS-500M-v2
---

# Irodori-TTS-600M-v3-VoiceDesign

[![Code](https://img.shields.io/badge/Code-GitHub-black)](https://github.com/Aratako/Irodori-TTS) [![WandB](https://img.shields.io/badge/Training%20Log-WandB-orange)](https://api.wandb.ai/links/aratako-lm/2ctrvcim) [![Demo Space](https://img.shields.io/badge/Demo-HuggingFace%20Space-red)](https://huggingface.co/spaces/Aratako/Irodori-TTS-600M-v3-VoiceDesign-Demo)

**Irodori-TTS-600M-v3-VoiceDesign** is an advanced Japanese Text-to-Speech model based on a Rectified Flow Diffusion Transformer (RF-DiT) architecture. Uniting the architectural enhancements of the v3 series with the caption-driven control concept from v2, this newly developed model introduces a highly flexible **Multi-modal Voice Design** system.

You can now generate and control speech using any combination of three core elements: **Text (Input) + Reference Speech + Caption Text**. This allows you to retain a specific speaker's vocal identity (via reference audio) while fully directing their emotion, speaking style, and delivery using a descriptive caption and emoji annotations.

## ๐ŸŒŸ Key Features

  * **Multi-modal Voice Design:** Simultaneously condition the generation on a reference audio clip (for voice cloning) and a text caption (for style/emotion control). 
  * **Flow Matching TTS:** Rectified Flow Diffusion Transformer over continuous DACVAE latents for high-quality Japanese speech synthesis.
  * **Emoji-based Style Control:** Embed emojis directly in the input text for granular control over the delivery and sound effects (e.g., laughter, coughing, sighs). See [`EMOJI_ANNOTATIONS.md`](EMOJI_ANNOTATIONS.md) for details.

## โœจ What's New in v3 VoiceDesign

This version integrates the architectural improvements of v3 with an evolved Voice Design capability:

  * **3-Factor Control (Text + Ref Voice + Caption):** Previously, Voice Design completely replaced the reference audio with a caption. Now, you can use *both*. Clone a voice and dictate *how* they speak via text captions.
  * **Variable-length Training & Duration Predictor:** Utilizes a Duration Predictor for improved training efficiency and enhanced Real-Time Factor (RTF) during inference.
  * **Expanded Training Data:** Trained on a larger dataset, resulting in more natural speech synthesis and improved robustness across complex styling combinations.
  * **Integrated Watermarking:** Integrates [SilentCipher](https://github.com/sony/silentcipher) to apply robust, invisible audio watermarks directly to the generated outputs, promoting responsible AI usage.

---

## ๐Ÿ—๏ธ Architecture

The model (approximately 600M parameters) consists of five main components:

1.  **Text Encoder:** Token embeddings initialized from [llm-jp/llm-jp-3-150m](https://huggingface.co/llm-jp/llm-jp-3-150m), followed by self-attention + SwiGLU transformer layers with RoPE.
2.  **Reference Latent Encoder:** Encodes patched reference audio latents for speaker identity conditioning.
3.  **Caption Encoder:** Encodes the style-control text (captions) to define the emotion, tone, and acoustic environment.
4.  **Diffusion Transformer:** Joint-attention DiT blocks combining text, reference, and caption conditioning with Low-Rank AdaLN, half-RoPE, and SwiGLU MLPs.
5.  **Duration Predictor:** Predicts audio duration from encoded text and conditioning vectors using stacked SwiGLU MLP blocks.

Audio is represented as continuous latent sequences via the [Aratako/Semantic-DACVAE-Japanese-32dim](https://huggingface.co/Aratako/Semantic-DACVAE-Japanese-32dim) codec (32-dim), enabling high-quality 48kHz waveform reconstruction.

---

## ๐ŸŽง Audio Samples

*Note: To clearly demonstrate the effect of captions, the samples within each group below were generated using the **exact same random seed**. The variations in delivery are purely the result of the changed prompts.*

### 1. Pure Voice Design (Text + Caption)
Generate diverse voices and styles purely through descriptive text captions without any reference audio.

| Text (Input) | Caption (Voice Design) | Generated Audio |
| :--- | :--- | :--- |
| ๆœฌๆ—ฅใฏใŠ่ถŠใ—ใ„ใŸใ ใใ€่ช ใซใ‚ใ‚ŠใŒใจใ†ใ”ใ–ใ„ใพใ™ใ€‚ใฉใ†ใžใ”ใ‚†ใฃใใ‚ŠใŠ้Žใ”ใ—ใใ ใ•ใ„ใ€‚ | ่ฝใก็€ใ„ใŸๅคงไบบใฎ็”ทๆ€งใ€‚ใƒ•ใ‚ฉใƒผใƒžใƒซใชๅ ดใงใ€ๆทฑใ้Ÿฟใๅฃฐใงไธๅฏงใ‹ใคๆญ“่ฟŽใฎๆ„ใ‚’่พผใ‚ใฆ่ฉฑใ—ใฆใ„ใ‚‹ใ€‚ | <audio controls src="https://huggingface.co/Aratako/Irodori-TTS-600M-v3-VoiceDesign/resolve/main/samples/sample1_1.wav"></audio> |
| ๆœฌๆ—ฅใฏใŠ่ถŠใ—ใ„ใŸใ ใใ€่ช ใซใ‚ใ‚ŠใŒใจใ†ใ”ใ–ใ„ใพใ™ใ€‚ใฉใ†ใžใ”ใ‚†ใฃใใ‚ŠใŠ้Žใ”ใ—ใใ ใ•ใ„ใ€‚ | ่‹ฅใๅ…ƒๆฐ—ใชๅฅณๆ€งใฎๅฃฐใ€‚ใ‚ซใƒ•ใ‚งใฎๅบ—ๅ“กใฎใ‚ˆใ†ใซใ€ๆ˜Žใ‚‹ใใƒใ‚ญใƒใ‚ญใจใ—ใŸๅฐ‘ใ—้ซ˜ใ‚ใฎใƒˆใƒผใƒณใง่ฉฑใ—ใฆใ„ใ‚‹ใ€‚ | <audio controls src="https://huggingface.co/Aratako/Irodori-TTS-600M-v3-VoiceDesign/resolve/main/samples/sample1_2.wav"></audio> |
| ใ™ใฟใพใ›ใ‚“๏ผใ“ใฎ่ฟ‘ใใซใ‚ณใƒณใƒ“ใƒ‹ใฃใฆใ‚ใ‚Šใพใ™ใ‹๏ผŸใกใ‚‡ใฃใจๆ€ฅใ„ใงใฆใ€้“ใซ่ฟทใฃใกใ‚ƒใฃใŸใฟใŸใ„ใง | ไฝŽใ‚ใฎๅฃฐใฎ็”ทๆ€งใŒใ€ไธๅฏงใซ้“ใ‚’ๅฐ‹ใญใฆใ„ใ‚‹ใ€‚็ฉใ‚„ใ‹ใง็คผๅ„€ๆญฃใ—ใใ€ไฝ™่ฃ•ใฎใ‚ใ‚‹ๅฃ่ชฟใ€‚ | <audio controls src="https://huggingface.co/Aratako/Irodori-TTS-600M-v3-VoiceDesign/resolve/main/samples/sample2_1.wav"></audio> |
| ใ™ใฟใพใ›ใ‚“๏ผใ“ใฎ่ฟ‘ใใซใ‚ณใƒณใƒ“ใƒ‹ใฃใฆใ‚ใ‚Šใพใ™ใ‹๏ผŸใกใ‚‡ใฃใจๆ€ฅใ„ใงใฆใ€้“ใซ่ฟทใฃใกใ‚ƒใฃใŸใฟใŸใ„ใง | ่‹ฅใ„ๅฅณๆ€งใŒใ€ๆ…ŒใฆใŸๆง˜ๅญใงๆ—ฉๅฃใซ่ฉฑใ—ใฆใ„ใ‚‹ใ€‚็„ฆใ‚Šใจไธๅฎ‰ใŒๅฃฐใซใซใ˜ใ‚“ใงใ„ใ‚‹ใ€‚ | <audio controls src="https://huggingface.co/Aratako/Irodori-TTS-600M-v3-VoiceDesign/resolve/main/samples/sample2_2.wav"></audio> |

### 2. Style-Controlled Voice Cloning (Text + Caption + Ref Speech)
Clone a voice using reference audio, and dictate the specific emotion or delivery style using a caption.

| Text (Input) | Ref Audio | Caption (Voice Design) | Generated Audio |
| :--- | :--- | :--- | :--- |
| ใฉใ†ใ—ใฆใ‚‚ใฃใจๆ—ฉใๆ•™ใˆใฆใใ‚Œใชใ‹ใฃใŸใฎ๏ผŸ็งใ€ใšใฃใจๅพ…ใฃใฆใŸใฎใซใ€‚ | <audio controls src="https://huggingface.co/Aratako/Irodori-TTS-600M-v3-VoiceDesign/resolve/main/samples/clone_ref1.wav"></audio> | ๆทฑใๅ‚ทใคใใ€ไปŠใซใ‚‚ๆณฃใๅ‡บใ—ใใ†ใชๆง˜ๅญใ€‚ๅฃฐใŒ้œ‡ใˆใฆใŠใ‚Šใ€ๆ‚ฒ็—›ใชใƒˆใƒผใƒณใงๅผฑใ€…ใ—ใ่ฉฑใ™ใ€‚ | <audio controls src="https://huggingface.co/Aratako/Irodori-TTS-600M-v3-VoiceDesign/resolve/main/samples/clone_gen_1.wav"></audio> |
| ใฉใ†ใ—ใฆใ‚‚ใฃใจๆ—ฉใๆ•™ใˆใฆใใ‚Œใชใ‹ใฃใŸใฎ๏ผŸ็งใ€ใšใฃใจๅพ…ใฃใฆใŸใฎใซใ€‚ | <audio controls src="https://huggingface.co/Aratako/Irodori-TTS-600M-v3-VoiceDesign/resolve/main/samples/clone_ref1.wav"></audio> | ๆฟ€ใ—ใ„ๆ€’ใ‚Šใ‚’ๆ„Ÿใ˜ใฆใŠใ‚Šใ€ๅฃฐใ‚’่’ใ‚‰ใ’ใฆใ„ใ‚‹ใ€‚็›ธๆ‰‹ใ‚’่ฒฌใ‚็ซ‹ใฆใ‚‹ใ‚ˆใ†ใชๅผทใ„ๅฃ่ชฟใงใ€ๆ„Ÿๆƒ…็š„ใชใƒˆใƒผใƒณใ€‚ | <audio controls src="https://huggingface.co/Aratako/Irodori-TTS-600M-v3-VoiceDesign/resolve/main/samples/clone_gen_2.wav"></audio> |
| ใฉใ†ใ—ใฆใ‚‚ใฃใจๆ—ฉใๆ•™ใˆใฆใใ‚Œใชใ‹ใฃใŸใฎ๏ผŸ็งใ€ใšใฃใจๅพ…ใฃใฆใŸใฎใซใ€‚ | <audio controls src="https://huggingface.co/Aratako/Irodori-TTS-600M-v3-VoiceDesign/resolve/main/samples/clone_ref1.wav"></audio> | ๅฎŒๅ…จใซๅ‘†ใ‚Œ่ฟ”ใฃใฆใ„ใ‚‹ๆง˜ๅญใ€‚ๆ„Ÿๆƒ…ใฎ่ตทไผใŒไนใ—ใใ€ๅ†ทใŸใ„ใƒˆใƒผใƒณใง้™ใ‹ใซ็ชใๆ”พใ™ใ‚ˆใ†ใซ่ฉฑใ™ใ€‚ | <audio controls src="https://huggingface.co/Aratako/Irodori-TTS-600M-v3-VoiceDesign/resolve/main/samples/clone_gen_3.wav"></audio> |

### 3. Fully Controlled Generation (Text + Caption + Ref Speech + Emoji)
Combine all control vectors for maximum expressiveness, adding specific physiological sounds (sighs, coughs) or distinct nuances via emojis on top of the cloned and styled voice.

| Text (with Emoji) | Ref Audio | Caption (Voice Design) | Generated Audio |
| :--- | :--- | :--- | :--- |
| ใ‚ใฏใฏใฃ๐Ÿคญใ€ใใ‚Œๆœฌๅฝ“ใซ่จ€ใฃใฆใ‚‹ใฎ๏ผŸโ€ฆ๐Ÿ˜ฎโ€๐Ÿ’จใพใใ€ๅ›ใ‚‰ใ—ใ„ใ‘ใฉใญใ€‚ | <audio controls src="https://huggingface.co/Aratako/Irodori-TTS-600M-v3-VoiceDesign/resolve/main/samples/clone_ref2.wav"></audio> | ไฝ™่ฃ•ใฎใ‚ใ‚‹ๅคงไบบใฎ็”ทๆ€งใ€‚่ฆชใ—ใ„็›ธๆ‰‹ใซๅฏพใ—ใฆใ€ใใ ใ‘ใŸ้›ฐๅ›ฒๆฐ—ใงๅ‘†ใ‚ŒใชใŒใ‚‰ใ‚‚ๆฅฝใ—ใใ†ใซ่ฉฑใ—ใฆใ„ใ‚‹ใ€‚ | <audio controls src="https://huggingface.co/Aratako/Irodori-TTS-600M-v3-VoiceDesign/resolve/main/samples/emoji_gen1.wav"></audio> |
| ใ‚ฒใƒ›ใƒƒใ€ใ‚ฒใƒ›ใƒƒ๐Ÿคงโ€ฆใ”ใ‚ใ‚“ใ€ๅฐ‘ใ—ไผ‘ใพใ›ใฆใ€‚๐Ÿ˜ญไปŠๆ—ฅใฏใ‚‚ใ†็„ก็†ใฟใŸใ„ใ€‚ | <audio controls src="https://huggingface.co/Aratako/Irodori-TTS-600M-v3-VoiceDesign/resolve/main/samples/clone_ref1.wav"></audio> | ไฝ“่ชฟใŒๆ‚ชใใ€้žๅธธใซ่‹ฆใ—ใใ†ใช่‹ฅใ„ๅฅณๆ€งใ€‚ๆฏใ‚‚็ตถใˆ็ตถใˆใซใ€็”ณใ—่จณใชใ•ใใ†ใซๅผฑใ€…ใ—ใ„ๅฃฐใง่ฉฑใ—ใฆใ„ใ‚‹ใ€‚ | <audio controls src="https://huggingface.co/Aratako/Irodori-TTS-600M-v3-VoiceDesign/resolve/main/samples/emoji_gen2.wav"></audio> |

---

## ๐Ÿš€ Usage

For inference code, installation instructions, and training scripts, please refer to the GitHub repository:

๐Ÿ‘‰ **[GitHub: Aratako/Irodori-TTS](https://github.com/Aratako/Irodori-TTS)**

## ๐Ÿ“Š Training Data & Annotation

The model was trained on an expanded, high-quality Japanese speech dataset. To enable the multi-modal Voice Design functionality, the training data was enriched with comprehensive text captions describing the audio characteristics. 

The emoji annotations and initial text captions were generated and labeled using a fine-tuned model based on [Qwen/Qwen3-Omni-30B-A3B-Instruct](https://huggingface.co/Qwen/Qwen3-Omni-30B-A3B-Instruct). Subsequently, the text captions were rephrased and refined using [Qwen/Qwen3.5-35B-A3B](https://huggingface.co/Qwen/Qwen3.5-35B-A3B).

## โš ๏ธ Limitations

  - **Japanese Only:** This model currently supports Japanese text input only.
  - **Conditioning Conflicts:** When using *both* Reference Audio and a Text Caption, providing contradictory instructions (e.g., providing a deep male reference voice but captioning "a high-pitched young girl") may result in unstable audio quality, unnatural artifacts, or one condition overriding the other. For optimal results, use the caption to guide the *emotion, style, or environment*, while keeping the base voice characteristics aligned with the reference audio.
  - **Prompt Adherence:** While the model generally follows the caption's instructions, highly complex or contradictory descriptions might result in inconsistent voice generation.
  - **Emoji Control:** While emoji-based style control adds expressiveness, the effect may vary depending on context and is not always perfectly consistent.
  - **Kanji Reading Accuracy:** The model's ability to accurately read Kanji is relatively weak compared to other TTS models of a similar size. You may need to convert complex Kanji into Hiragana or Katakana beforehand.

## ๐Ÿ“œ License & Ethical Restrictions

### License

This model is released under **[MIT](https://choosealicense.com/licenses/mit/)**.

### Ethical Restrictions

In addition to the license terms, the following ethical restrictions apply:

1.  **No Impersonation:** Do not use this model to clone or impersonate the voice of any individual (e.g., voice actors, celebrities, public figures) without their explicit consent.
2.  **No Misinformation:** Do not use this model to generate deepfakes or synthetic speech intended to mislead others or spread misinformation.
3.  **Voice Generation Disclaimer:** When generating speech purely from text or captions without using a reference audio, it is possible that the generated voice may coincidentally resemble that of a real person. This is strictly a probabilistic artifact within the latent space. The model was not trained with the intent of reproducing specific individuals.
4.  **Liability Disclaimer:** The developers assume no liability for any misuse of this model. Users are solely responsible for ensuring their use of the generated content complies with applicable laws and regulations in their jurisdiction.

## ๐Ÿ™ Acknowledgments

This project builds upon the following works:

  - [Echo-TTS](https://jordandarefsky.com/blog/2025/echo/) โ€” Architecture and training design reference
  - [DACVAE](https://github.com/facebookresearch/dacvae) โ€” Audio VAE
  - [llm-jp/llm-jp-3-150m](https://huggingface.co/llm-jp/llm-jp-3-150m) โ€” Tokenizer and embedding weight initialization
  - [SilentCipher](https://github.com/sony/silentcipher) โ€” Audio watermarking integration

We would also like to extend our special thanks to **[Respair](https://huggingface.co/Respair)** for the inspiration behind the emoji annotation feature, and to [gabrielclark3330](https://huggingface.co/gabrielclark3330) for supporting this project.

## ๐Ÿ–Š๏ธ Citation

If you use Irodori-TTS in your research or project, please cite it as follows:

```bibtex
@misc{irodori-tts-v3-voicedesign,
  author = {Chihiro Arata},
  title = {Irodori-TTS: A Flow Matching-based Text-to-Speech Model with Emoji-driven Style Control},
  year = {2026},
  publisher = {Hugging Face},
  journal = {Hugging Face repository},
  howpublished = {\url{https://huggingface.co/Aratako/Irodori-TTS-600M-v3-VoiceDesign}}
}
```