--- license: mit language: - ja pipeline_tag: text-to-speech tags: - speech - voice - tts base_model: - Aratako/Irodori-TTS-500M-v2 --- # Irodori-TTS-600M-v3-VoiceDesign [![Code](https://img.shields.io/badge/Code-GitHub-black)](https://github.com/Aratako/Irodori-TTS) [![WandB](https://img.shields.io/badge/Training%20Log-WandB-orange)](https://api.wandb.ai/links/aratako-lm/2ctrvcim) [![Demo Space](https://img.shields.io/badge/Demo-HuggingFace%20Space-red)](https://huggingface.co/spaces/Aratako/Irodori-TTS-600M-v3-VoiceDesign-Demo) **Irodori-TTS-600M-v3-VoiceDesign** is an advanced Japanese Text-to-Speech model based on a Rectified Flow Diffusion Transformer (RF-DiT) architecture. Uniting the architectural enhancements of the v3 series with the caption-driven control concept from v2, this newly developed model introduces a highly flexible **Multi-modal Voice Design** system. You can now generate and control speech using any combination of three core elements: **Text (Input) + Reference Speech + Caption Text**. This allows you to retain a specific speaker's vocal identity (via reference audio) while fully directing their emotion, speaking style, and delivery using a descriptive caption and emoji annotations. ## ๐ŸŒŸ Key Features * **Multi-modal Voice Design:** Simultaneously condition the generation on a reference audio clip (for voice cloning) and a text caption (for style/emotion control). * **Flow Matching TTS:** Rectified Flow Diffusion Transformer over continuous DACVAE latents for high-quality Japanese speech synthesis. * **Emoji-based Style Control:** Embed emojis directly in the input text for granular control over the delivery and sound effects (e.g., laughter, coughing, sighs). See [`EMOJI_ANNOTATIONS.md`](EMOJI_ANNOTATIONS.md) for details. ## โœจ What's New in v3 VoiceDesign This version integrates the architectural improvements of v3 with an evolved Voice Design capability: * **3-Factor Control (Text + Ref Voice + Caption):** Previously, Voice Design completely replaced the reference audio with a caption. Now, you can use *both*. Clone a voice and dictate *how* they speak via text captions. * **Variable-length Training & Duration Predictor:** Utilizes a Duration Predictor for improved training efficiency and enhanced Real-Time Factor (RTF) during inference. * **Expanded Training Data:** Trained on a larger dataset, resulting in more natural speech synthesis and improved robustness across complex styling combinations. * **Integrated Watermarking:** Integrates [SilentCipher](https://github.com/sony/silentcipher) to apply robust, invisible audio watermarks directly to the generated outputs, promoting responsible AI usage. --- ## ๐Ÿ—๏ธ Architecture The model (approximately 600M parameters) consists of five main components: 1. **Text Encoder:** Token embeddings initialized from [llm-jp/llm-jp-3-150m](https://huggingface.co/llm-jp/llm-jp-3-150m), followed by self-attention + SwiGLU transformer layers with RoPE. 2. **Reference Latent Encoder:** Encodes patched reference audio latents for speaker identity conditioning. 3. **Caption Encoder:** Encodes the style-control text (captions) to define the emotion, tone, and acoustic environment. 4. **Diffusion Transformer:** Joint-attention DiT blocks combining text, reference, and caption conditioning with Low-Rank AdaLN, half-RoPE, and SwiGLU MLPs. 5. **Duration Predictor:** Predicts audio duration from encoded text and conditioning vectors using stacked SwiGLU MLP blocks. Audio is represented as continuous latent sequences via the [Aratako/Semantic-DACVAE-Japanese-32dim](https://huggingface.co/Aratako/Semantic-DACVAE-Japanese-32dim) codec (32-dim), enabling high-quality 48kHz waveform reconstruction. --- ## ๐ŸŽง Audio Samples *Note: To clearly demonstrate the effect of captions, the samples within each group below were generated using the **exact same random seed**. The variations in delivery are purely the result of the changed prompts.* ### 1. Pure Voice Design (Text + Caption) Generate diverse voices and styles purely through descriptive text captions without any reference audio. | Text (Input) | Caption (Voice Design) | Generated Audio | | :--- | :--- | :--- | | ๆœฌๆ—ฅใฏใŠ่ถŠใ—ใ„ใŸใ ใใ€่ช ใซใ‚ใ‚ŠใŒใจใ†ใ”ใ–ใ„ใพใ™ใ€‚ใฉใ†ใžใ”ใ‚†ใฃใใ‚ŠใŠ้Žใ”ใ—ใใ ใ•ใ„ใ€‚ | ่ฝใก็€ใ„ใŸๅคงไบบใฎ็”ทๆ€งใ€‚ใƒ•ใ‚ฉใƒผใƒžใƒซใชๅ ดใงใ€ๆทฑใ้Ÿฟใๅฃฐใงไธๅฏงใ‹ใคๆญ“่ฟŽใฎๆ„ใ‚’่พผใ‚ใฆ่ฉฑใ—ใฆใ„ใ‚‹ใ€‚ | | | ๆœฌๆ—ฅใฏใŠ่ถŠใ—ใ„ใŸใ ใใ€่ช ใซใ‚ใ‚ŠใŒใจใ†ใ”ใ–ใ„ใพใ™ใ€‚ใฉใ†ใžใ”ใ‚†ใฃใใ‚ŠใŠ้Žใ”ใ—ใใ ใ•ใ„ใ€‚ | ่‹ฅใๅ…ƒๆฐ—ใชๅฅณๆ€งใฎๅฃฐใ€‚ใ‚ซใƒ•ใ‚งใฎๅบ—ๅ“กใฎใ‚ˆใ†ใซใ€ๆ˜Žใ‚‹ใใƒใ‚ญใƒใ‚ญใจใ—ใŸๅฐ‘ใ—้ซ˜ใ‚ใฎใƒˆใƒผใƒณใง่ฉฑใ—ใฆใ„ใ‚‹ใ€‚ | | | ใ™ใฟใพใ›ใ‚“๏ผใ“ใฎ่ฟ‘ใใซใ‚ณใƒณใƒ“ใƒ‹ใฃใฆใ‚ใ‚Šใพใ™ใ‹๏ผŸใกใ‚‡ใฃใจๆ€ฅใ„ใงใฆใ€้“ใซ่ฟทใฃใกใ‚ƒใฃใŸใฟใŸใ„ใง | ไฝŽใ‚ใฎๅฃฐใฎ็”ทๆ€งใŒใ€ไธๅฏงใซ้“ใ‚’ๅฐ‹ใญใฆใ„ใ‚‹ใ€‚็ฉใ‚„ใ‹ใง็คผๅ„€ๆญฃใ—ใใ€ไฝ™่ฃ•ใฎใ‚ใ‚‹ๅฃ่ชฟใ€‚ | | | ใ™ใฟใพใ›ใ‚“๏ผใ“ใฎ่ฟ‘ใใซใ‚ณใƒณใƒ“ใƒ‹ใฃใฆใ‚ใ‚Šใพใ™ใ‹๏ผŸใกใ‚‡ใฃใจๆ€ฅใ„ใงใฆใ€้“ใซ่ฟทใฃใกใ‚ƒใฃใŸใฟใŸใ„ใง | ่‹ฅใ„ๅฅณๆ€งใŒใ€ๆ…ŒใฆใŸๆง˜ๅญใงๆ—ฉๅฃใซ่ฉฑใ—ใฆใ„ใ‚‹ใ€‚็„ฆใ‚Šใจไธๅฎ‰ใŒๅฃฐใซใซใ˜ใ‚“ใงใ„ใ‚‹ใ€‚ | | ### 2. Style-Controlled Voice Cloning (Text + Caption + Ref Speech) Clone a voice using reference audio, and dictate the specific emotion or delivery style using a caption. | Text (Input) | Ref Audio | Caption (Voice Design) | Generated Audio | | :--- | :--- | :--- | :--- | | ใฉใ†ใ—ใฆใ‚‚ใฃใจๆ—ฉใๆ•™ใˆใฆใใ‚Œใชใ‹ใฃใŸใฎ๏ผŸ็งใ€ใšใฃใจๅพ…ใฃใฆใŸใฎใซใ€‚ | | ๆทฑใๅ‚ทใคใใ€ไปŠใซใ‚‚ๆณฃใๅ‡บใ—ใใ†ใชๆง˜ๅญใ€‚ๅฃฐใŒ้œ‡ใˆใฆใŠใ‚Šใ€ๆ‚ฒ็—›ใชใƒˆใƒผใƒณใงๅผฑใ€…ใ—ใ่ฉฑใ™ใ€‚ | | | ใฉใ†ใ—ใฆใ‚‚ใฃใจๆ—ฉใๆ•™ใˆใฆใใ‚Œใชใ‹ใฃใŸใฎ๏ผŸ็งใ€ใšใฃใจๅพ…ใฃใฆใŸใฎใซใ€‚ | | ๆฟ€ใ—ใ„ๆ€’ใ‚Šใ‚’ๆ„Ÿใ˜ใฆใŠใ‚Šใ€ๅฃฐใ‚’่’ใ‚‰ใ’ใฆใ„ใ‚‹ใ€‚็›ธๆ‰‹ใ‚’่ฒฌใ‚็ซ‹ใฆใ‚‹ใ‚ˆใ†ใชๅผทใ„ๅฃ่ชฟใงใ€ๆ„Ÿๆƒ…็š„ใชใƒˆใƒผใƒณใ€‚ | | | ใฉใ†ใ—ใฆใ‚‚ใฃใจๆ—ฉใๆ•™ใˆใฆใใ‚Œใชใ‹ใฃใŸใฎ๏ผŸ็งใ€ใšใฃใจๅพ…ใฃใฆใŸใฎใซใ€‚ | | ๅฎŒๅ…จใซๅ‘†ใ‚Œ่ฟ”ใฃใฆใ„ใ‚‹ๆง˜ๅญใ€‚ๆ„Ÿๆƒ…ใฎ่ตทไผใŒไนใ—ใใ€ๅ†ทใŸใ„ใƒˆใƒผใƒณใง้™ใ‹ใซ็ชใๆ”พใ™ใ‚ˆใ†ใซ่ฉฑใ™ใ€‚ | | ### 3. Fully Controlled Generation (Text + Caption + Ref Speech + Emoji) Combine all control vectors for maximum expressiveness, adding specific physiological sounds (sighs, coughs) or distinct nuances via emojis on top of the cloned and styled voice. | Text (with Emoji) | Ref Audio | Caption (Voice Design) | Generated Audio | | :--- | :--- | :--- | :--- | | ใ‚ใฏใฏใฃ๐Ÿคญใ€ใใ‚Œๆœฌๅฝ“ใซ่จ€ใฃใฆใ‚‹ใฎ๏ผŸโ€ฆ๐Ÿ˜ฎโ€๐Ÿ’จใพใใ€ๅ›ใ‚‰ใ—ใ„ใ‘ใฉใญใ€‚ | | ไฝ™่ฃ•ใฎใ‚ใ‚‹ๅคงไบบใฎ็”ทๆ€งใ€‚่ฆชใ—ใ„็›ธๆ‰‹ใซๅฏพใ—ใฆใ€ใใ ใ‘ใŸ้›ฐๅ›ฒๆฐ—ใงๅ‘†ใ‚ŒใชใŒใ‚‰ใ‚‚ๆฅฝใ—ใใ†ใซ่ฉฑใ—ใฆใ„ใ‚‹ใ€‚ | | | ใ‚ฒใƒ›ใƒƒใ€ใ‚ฒใƒ›ใƒƒ๐Ÿคงโ€ฆใ”ใ‚ใ‚“ใ€ๅฐ‘ใ—ไผ‘ใพใ›ใฆใ€‚๐Ÿ˜ญไปŠๆ—ฅใฏใ‚‚ใ†็„ก็†ใฟใŸใ„ใ€‚ | | ไฝ“่ชฟใŒๆ‚ชใใ€้žๅธธใซ่‹ฆใ—ใใ†ใช่‹ฅใ„ๅฅณๆ€งใ€‚ๆฏใ‚‚็ตถใˆ็ตถใˆใซใ€็”ณใ—่จณใชใ•ใใ†ใซๅผฑใ€…ใ—ใ„ๅฃฐใง่ฉฑใ—ใฆใ„ใ‚‹ใ€‚ | | --- ## ๐Ÿš€ Usage For inference code, installation instructions, and training scripts, please refer to the GitHub repository: ๐Ÿ‘‰ **[GitHub: Aratako/Irodori-TTS](https://github.com/Aratako/Irodori-TTS)** ## ๐Ÿ“Š Training Data & Annotation The model was trained on an expanded, high-quality Japanese speech dataset. To enable the multi-modal Voice Design functionality, the training data was enriched with comprehensive text captions describing the audio characteristics. The emoji annotations and initial text captions were generated and labeled using a fine-tuned model based on [Qwen/Qwen3-Omni-30B-A3B-Instruct](https://huggingface.co/Qwen/Qwen3-Omni-30B-A3B-Instruct). Subsequently, the text captions were rephrased and refined using [Qwen/Qwen3.5-35B-A3B](https://huggingface.co/Qwen/Qwen3.5-35B-A3B). ## โš ๏ธ Limitations - **Japanese Only:** This model currently supports Japanese text input only. - **Conditioning Conflicts:** When using *both* Reference Audio and a Text Caption, providing contradictory instructions (e.g., providing a deep male reference voice but captioning "a high-pitched young girl") may result in unstable audio quality, unnatural artifacts, or one condition overriding the other. For optimal results, use the caption to guide the *emotion, style, or environment*, while keeping the base voice characteristics aligned with the reference audio. - **Prompt Adherence:** While the model generally follows the caption's instructions, highly complex or contradictory descriptions might result in inconsistent voice generation. - **Emoji Control:** While emoji-based style control adds expressiveness, the effect may vary depending on context and is not always perfectly consistent. - **Kanji Reading Accuracy:** The model's ability to accurately read Kanji is relatively weak compared to other TTS models of a similar size. You may need to convert complex Kanji into Hiragana or Katakana beforehand. ## ๐Ÿ“œ License & Ethical Restrictions ### License This model is released under **[MIT](https://choosealicense.com/licenses/mit/)**. ### Ethical Restrictions In addition to the license terms, the following ethical restrictions apply: 1. **No Impersonation:** Do not use this model to clone or impersonate the voice of any individual (e.g., voice actors, celebrities, public figures) without their explicit consent. 2. **No Misinformation:** Do not use this model to generate deepfakes or synthetic speech intended to mislead others or spread misinformation. 3. **Voice Generation Disclaimer:** When generating speech purely from text or captions without using a reference audio, it is possible that the generated voice may coincidentally resemble that of a real person. This is strictly a probabilistic artifact within the latent space. The model was not trained with the intent of reproducing specific individuals. 4. **Liability Disclaimer:** The developers assume no liability for any misuse of this model. Users are solely responsible for ensuring their use of the generated content complies with applicable laws and regulations in their jurisdiction. ## ๐Ÿ™ Acknowledgments This project builds upon the following works: - [Echo-TTS](https://jordandarefsky.com/blog/2025/echo/) โ€” Architecture and training design reference - [DACVAE](https://github.com/facebookresearch/dacvae) โ€” Audio VAE - [llm-jp/llm-jp-3-150m](https://huggingface.co/llm-jp/llm-jp-3-150m) โ€” Tokenizer and embedding weight initialization - [SilentCipher](https://github.com/sony/silentcipher) โ€” Audio watermarking integration We would also like to extend our special thanks to **[Respair](https://huggingface.co/Respair)** for the inspiration behind the emoji annotation feature, and to [gabrielclark3330](https://huggingface.co/gabrielclark3330) for supporting this project. ## ๐Ÿ–Š๏ธ Citation If you use Irodori-TTS in your research or project, please cite it as follows: ```bibtex @misc{irodori-tts-v3-voicedesign, author = {Chihiro Arata}, title = {Irodori-TTS: A Flow Matching-based Text-to-Speech Model with Emoji-driven Style Control}, year = {2026}, publisher = {Hugging Face}, journal = {Hugging Face repository}, howpublished = {\url{https://huggingface.co/Aratako/Irodori-TTS-600M-v3-VoiceDesign}} } ```