Instructions to use kyutai/tts-1.6b-en_fr with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Moshi
How to use kyutai/tts-1.6b-en_fr with Moshi:
# pip install moshi # Run the interactive web server python -m moshi.server --hf-repo "kyutai/tts-1.6b-en_fr" # Then open https://localhost:8998 in your browser
# pip install moshi import torch from moshi.models import loaders # Load checkpoint info from HuggingFace checkpoint = loaders.CheckpointInfo.from_hf_repo("kyutai/tts-1.6b-en_fr") # Load the Mimi audio codec mimi = checkpoint.get_mimi(device="cuda") mimi.set_num_codebooks(8) # Encode audio (24kHz, mono) wav = torch.randn(1, 1, 24000 * 10) # [batch, channels, samples] with torch.no_grad(): codes = mimi.encode(wav.cuda()) decoded = mimi.decode(codes) - Notebooks
- Google Colab
- Kaggle
Update README.md
Browse files
README.md
CHANGED
|
@@ -10,8 +10,9 @@ tags:
|
|
| 10 |
---
|
| 11 |
# Model Card for Kyutai TTS
|
| 12 |
|
| 13 |
-
See also the [project page](https://kyutai.org/next/tts)
|
| 14 |
-
|
|
|
|
| 15 |
|
| 16 |
This is a model for streaming text-to-speech (TTS).
|
| 17 |
Unlike offline text-to-speech, where the model needs the entire text to produce the audio,
|
|
@@ -21,8 +22,9 @@ our model starts to output audio as soon as the first few words from the text ha
|
|
| 21 |
|
| 22 |
The model architecture is a hierarchical Transformer that consumes tokenized text and generateds audio tokenized by Mimi,
|
| 23 |
see [the Moshi paper](https://arxiv.org/abs/2410.00037).
|
| 24 |
-
The frame rate is 12.5 Hz and each audio frame is represented by 32 audio tokens.
|
| 25 |
-
|
|
|
|
| 26 |
|
| 27 |
## Model Description
|
| 28 |
|
|
@@ -40,7 +42,17 @@ The text stream is shifted w.r.t. the audio stream to allow the model to predict
|
|
| 40 |
|
| 41 |
### Direct Use
|
| 42 |
|
| 43 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 44 |
|
| 45 |
## How to Get Started with the Model
|
| 46 |
|
|
@@ -48,16 +60,18 @@ See the [GitHub repository](https://github.com/kyutai-labs/delayed-streams-model
|
|
| 48 |
|
| 49 |
## Training Details
|
| 50 |
|
|
|
|
|
|
|
| 51 |
### Training Data
|
| 52 |
|
| 53 |
Pretraining stage: we use an audio collection of 2.5 million hours of publicly available audio content.
|
| 54 |
-
For this dataset, we obtained synthetic transcripts by running [whisper-timestamped](https://github.com/linto-ai/whisper-timestamped)
|
|
|
|
| 55 |
|
| 56 |
-
- Finetuning stage: TODO
|
| 57 |
|
| 58 |
### Compute Infrastructure
|
| 59 |
|
| 60 |
-
Pretraining and finetuning was done with
|
| 61 |
|
| 62 |
## Model Card Authors
|
| 63 |
|
|
|
|
| 10 |
---
|
| 11 |
# Model Card for Kyutai TTS
|
| 12 |
|
| 13 |
+
See also the [project page](https://kyutai.org/next/tts),
|
| 14 |
+
the [Colab example](https://colab.research.google.com/github/kyutai-labs/delayed-streams-modeling/blob/main/tts_pytorch.ipynb),
|
| 15 |
+
and the [GitHub repository](https://github.com/kyutai-labs/delayed-streams-modeling/). Pre-print research paper is coming soon!
|
| 16 |
|
| 17 |
This is a model for streaming text-to-speech (TTS).
|
| 18 |
Unlike offline text-to-speech, where the model needs the entire text to produce the audio,
|
|
|
|
| 22 |
|
| 23 |
The model architecture is a hierarchical Transformer that consumes tokenized text and generateds audio tokenized by Mimi,
|
| 24 |
see [the Moshi paper](https://arxiv.org/abs/2410.00037).
|
| 25 |
+
The frame rate is 12.5 Hz and each audio frame is represented by 32 audio tokens, although you can use less tokens at inference time for faster generation.
|
| 26 |
+
The backbone model is 1B parameters, and the depth transformer is 600M parameters and uses partial weight sharing similar to [Hibiki](https://arxiv.org/abs/2502.03382).
|
| 27 |
+
The audio is shifted by 16 steps (1.28 sec.) with respect to the text, and the model uses an acoustic/semantic delay of 2.
|
| 28 |
|
| 29 |
## Model Description
|
| 30 |
|
|
|
|
| 42 |
|
| 43 |
### Direct Use
|
| 44 |
|
| 45 |
+
This model is able to perform streaming text-to-speech generation, including dialogs.
|
| 46 |
+
The model supports voice conditioning through cross-attention pre-computed embeddings,
|
| 47 |
+
which are provided for a number of voices in our [tts-voices](https://huggingface.co/kyutai/tts-voices) repository.
|
| 48 |
+
This model does not support Classifier Free Guidance (CFG) directly, but was trained with *CFG distillation* for improved speed (no need to double the batch size).
|
| 49 |
+
It is easy to batch and can reach a throughput of 75x generated audio per compute unit of time.
|
| 50 |
+
|
| 51 |
+
This model does not perform watermarking for two reasons:
|
| 52 |
+
- watermarking can easily be deactivated for open source models,
|
| 53 |
+
- our early experiments show that all watermark systems used by existing TTS are removed by simply encodeding and decoding the audio with Mimi.
|
| 54 |
+
|
| 55 |
+
Instead, we prefered to restrict the voice cloning ability to the use of pre-computed voice embeddings.
|
| 56 |
|
| 57 |
## How to Get Started with the Model
|
| 58 |
|
|
|
|
| 60 |
|
| 61 |
## Training Details
|
| 62 |
|
| 63 |
+
The model was trained for 750k steps, with a batch size of 64, and a segment duration of 120 seconds.
|
| 64 |
+
|
| 65 |
### Training Data
|
| 66 |
|
| 67 |
Pretraining stage: we use an audio collection of 2.5 million hours of publicly available audio content.
|
| 68 |
+
For this dataset, we obtained synthetic transcripts by running [whisper-timestamped](https://github.com/linto-ai/whisper-timestamped)
|
| 69 |
+
with `whisper-medium`.
|
| 70 |
|
|
|
|
| 71 |
|
| 72 |
### Compute Infrastructure
|
| 73 |
|
| 74 |
+
Pretraining and finetuning was done with 32 H100 Nvidia GPUs.
|
| 75 |
|
| 76 |
## Model Card Authors
|
| 77 |
|