kyutai
/

tts-1.6b-en_fr

@@ -10,8 +10,9 @@ tags:
 ---
 # Model Card for Kyutai TTS
-See also the [project page](https://kyutai.org/next/tts)
-and the [GitHub repository](https://github.com/kyutai-labs/delayed-streams-modeling/).
 This is a model for streaming text-to-speech (TTS).
 Unlike offline text-to-speech, where the model needs the entire text to produce the audio,
@@ -21,8 +22,9 @@ our model starts to output audio as soon as the first few words from the text ha
 The model architecture is a hierarchical Transformer that consumes tokenized text and generateds audio tokenized by Mimi,
 see [the Moshi paper](https://arxiv.org/abs/2410.00037).
-The frame rate is 12.5 Hz and each audio frame is represented by 32 audio tokens.
 ## Model Description
@@ -40,7 +42,17 @@ The text stream is shifted w.r.t. the audio stream to allow the model to predict
 ### Direct Use
-TODO
 ## How to Get Started with the Model
@@ -48,16 +60,18 @@ See the [GitHub repository](https://github.com/kyutai-labs/delayed-streams-model
 ## Training Details
 ### Training Data
 Pretraining stage: we use an audio collection of 2.5 million hours of publicly available audio content.
-For this dataset, we obtained synthetic transcripts by running [whisper-timestamped](https://github.com/linto-ai/whisper-timestamped).
-- Finetuning stage: TODO
 ### Compute Infrastructure
-Pretraining and finetuning was done with 48 and 16 H100 Nvidia GPUs, respectively.
 ## Model Card Authors

 ---
 # Model Card for Kyutai TTS
+See also the [project page](https://kyutai.org/next/tts),
+the [Colab example](https://colab.research.google.com/github/kyutai-labs/delayed-streams-modeling/blob/main/tts_pytorch.ipynb),
+and the [GitHub repository](https://github.com/kyutai-labs/delayed-streams-modeling/). Pre-print research paper is coming soon!
 This is a model for streaming text-to-speech (TTS).
 Unlike offline text-to-speech, where the model needs the entire text to produce the audio,
 The model architecture is a hierarchical Transformer that consumes tokenized text and generateds audio tokenized by Mimi,
 see [the Moshi paper](https://arxiv.org/abs/2410.00037).
+The frame rate is 12.5 Hz and each audio frame is represented by 32 audio tokens, although you can use less tokens at inference time for faster generation.
+The backbone model is 1B parameters, and the depth transformer is 600M parameters and uses partial weight sharing similar to [Hibiki](https://arxiv.org/abs/2502.03382).
+The audio is shifted by 16 steps (1.28 sec.) with respect to the text, and the model uses an acoustic/semantic delay of 2.
 ## Model Description
 ### Direct Use
+This model is able to perform streaming text-to-speech generation, including dialogs.
+The model supports voice conditioning through cross-attention pre-computed embeddings,
+which are provided for a number of voices in our [tts-voices](https://huggingface.co/kyutai/tts-voices) repository.
+This model does not support Classifier Free Guidance (CFG) directly, but was trained with *CFG distillation* for improved speed (no need to double the batch size).
+It is easy to batch and can reach a throughput of 75x generated audio per compute unit of time.
+This model does not perform watermarking for two reasons:
+- watermarking can easily be deactivated for open source models,
+- our early experiments show that all watermark systems used by existing TTS are removed by simply encodeding and decoding the audio with Mimi.
+Instead, we prefered to restrict the voice cloning ability to the use of pre-computed voice embeddings.
 ## How to Get Started with the Model
 ## Training Details
+The model was trained for 750k steps, with a batch size of 64, and a segment duration of 120 seconds.
 ### Training Data
 Pretraining stage: we use an audio collection of 2.5 million hours of publicly available audio content.
+For this dataset, we obtained synthetic transcripts by running [whisper-timestamped](https://github.com/linto-ai/whisper-timestamped)
+with `whisper-medium`.
 ### Compute Infrastructure
+Pretraining and finetuning was done with 32 H100 Nvidia GPUs.
 ## Model Card Authors