Instructions to use kyutai/tts-1.6b-en_fr with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Moshi
How to use kyutai/tts-1.6b-en_fr with Moshi:
# pip install moshi # Run the interactive web server python -m moshi.server --hf-repo "kyutai/tts-1.6b-en_fr" # Then open https://localhost:8998 in your browser
# pip install moshi import torch from moshi.models import loaders # Load checkpoint info from HuggingFace checkpoint = loaders.CheckpointInfo.from_hf_repo("kyutai/tts-1.6b-en_fr") # Load the Mimi audio codec mimi = checkpoint.get_mimi(device="cuda") mimi.set_num_codebooks(8) # Encode audio (24kHz, mono) wav = torch.randn(1, 1, 24000 * 10) # [batch, channels, samples] with torch.no_grad(): codes = mimi.encode(wav.cuda()) decoded = mimi.decode(codes) - Notebooks
- Google Colab
- Kaggle
| license: cc-by-4.0 | |
| language: | |
| - en | |
| - fr | |
| library_name: moshi | |
| pipeline_tag: text-to-speech | |
| tags: | |
| - audio | |
| # Model Card for Kyutai TTS | |
| See also the [pre-print research paper](https://arxiv.org/abs/2509.08753), | |
| the [project page](https://kyutai.org/next/tts), | |
| the [Colab example](https://colab.research.google.com/github/kyutai-labs/delayed-streams-modeling/blob/main/tts_pytorch.ipynb), | |
| the [GitHub repository](https://github.com/kyutai-labs/delayed-streams-modeling/), | |
| and the [repository of voices](https://huggingface.co/kyutai/tts-voices). | |
| This is a model for streaming text-to-speech (TTS). | |
| Unlike offline text-to-speech, where the model needs the entire text to produce the audio, | |
| our model starts to output audio as soon as the first few words from the text have been given as input. | |
| This model is actually 1.8B parameters, not 1.6B as the name might suggest. | |
| ## Model Details | |
| The model architecture is a hierarchical Transformer that consumes tokenized text and generateds audio tokenized by Mimi, | |
| see [the Moshi paper](https://arxiv.org/abs/2410.00037). | |
| The frame rate is 12.5 Hz and each audio frame is represented by 32 audio tokens, although you can use less tokens at inference time for faster generation. | |
| The backbone model is 1B parameters, and the depth transformer is 600M parameters and uses partial weight sharing similar to [Hibiki](https://arxiv.org/abs/2502.03382). | |
| The audio is shifted by 16 steps (1.28 sec.) with respect to the text, and the model uses an acoustic/semantic delay of 2. | |
| ## Model Description | |
| Kyutai TTS is a decoder-only model for streaming speech-to-text. | |
| It leverages the multistream architecture of [Moshi](https://moshi.chat/) to model text stream based on the speech stream. | |
| The text stream is shifted w.r.t. the audio stream to allow the model to predict text tokens based on the input audio. | |
| * Developed by: Kyutai | |
| * Model type: Streaming Text-To-Speech. | |
| * Language(s) (NLP): English and French | |
| * License: Model weights are licensed under CC-BY 4.0 | |
| * Repository: [GitHub](https://github.com/kyutai-labs/delayed-streams-modeling/) | |
| ## Uses | |
| ### Direct Use | |
| This model is able to perform streaming text-to-speech generation, including dialogs. | |
| The model supports voice conditioning through cross-attention pre-computed embeddings, | |
| which are provided for a number of voices in our [tts-voices](https://huggingface.co/kyutai/tts-voices) repository. | |
| This model does not support Classifier Free Guidance (CFG) directly, but was trained with *CFG distillation* for improved speed (no need to double the batch size). | |
| It is easy to batch and can reach a throughput of 75x generated audio per compute unit of time. | |
| This model does not perform watermarking for two reasons: | |
| - watermarking can easily be deactivated for open source models, | |
| - our early experiments show that all watermark systems used by existing TTS are removed by simply encodeding and decoding the audio with Mimi. | |
| Instead, we prefered to restrict the voice cloning ability to the use of pre-computed voice embeddings. | |
| ## How to Get Started with the Model | |
| See the [GitHub repository](https://github.com/kyutai-labs/delayed-streams-modeling/). | |
| ## Training Details | |
| The model was trained for 750k steps, with a batch size of 64, and a segment duration of 120 seconds. Then, CFG distillation was performed for 24k updates. | |
| ### Training Data | |
| Pretraining stage: we use an audio collection of 2.5 million hours of publicly available audio content. | |
| For this dataset, we obtained synthetic transcripts by running [whisper-timestamped](https://github.com/linto-ai/whisper-timestamped) | |
| with `whisper-medium`. | |
| ### Compute Infrastructure | |
| Pretraining was done with 32 H100 Nvidia GPUs. CFG distillation was done on 8 such GPUs. | |
| ## Model Card Authors | |
| Neil Zeghidour, Eugene Kharitonov, Manu Orsini, Václav Volhejn, Gabriel de Marmiesse, Edouard Grave, Patrick Perez, Laurent Mazaré, Alexandre Défossez |