Text-to-Speech
Moshi
English
French
tts
audio
adefossez commited on
Commit
172acde
·
verified ·
1 Parent(s): f2feada

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +22 -8
README.md CHANGED
@@ -10,8 +10,9 @@ tags:
10
  ---
11
  # Model Card for Kyutai TTS
12
 
13
- See also the [project page](https://kyutai.org/next/tts)
14
- and the [GitHub repository](https://github.com/kyutai-labs/delayed-streams-modeling/).
 
15
 
16
  This is a model for streaming text-to-speech (TTS).
17
  Unlike offline text-to-speech, where the model needs the entire text to produce the audio,
@@ -21,8 +22,9 @@ our model starts to output audio as soon as the first few words from the text ha
21
 
22
  The model architecture is a hierarchical Transformer that consumes tokenized text and generateds audio tokenized by Mimi,
23
  see [the Moshi paper](https://arxiv.org/abs/2410.00037).
24
- The frame rate is 12.5 Hz and each audio frame is represented by 32 audio tokens.
25
-
 
26
 
27
  ## Model Description
28
 
@@ -40,7 +42,17 @@ The text stream is shifted w.r.t. the audio stream to allow the model to predict
40
 
41
  ### Direct Use
42
 
43
- TODO
 
 
 
 
 
 
 
 
 
 
44
 
45
  ## How to Get Started with the Model
46
 
@@ -48,16 +60,18 @@ See the [GitHub repository](https://github.com/kyutai-labs/delayed-streams-model
48
 
49
  ## Training Details
50
 
 
 
51
  ### Training Data
52
 
53
  Pretraining stage: we use an audio collection of 2.5 million hours of publicly available audio content.
54
- For this dataset, we obtained synthetic transcripts by running [whisper-timestamped](https://github.com/linto-ai/whisper-timestamped).
 
55
 
56
- - Finetuning stage: TODO
57
 
58
  ### Compute Infrastructure
59
 
60
- Pretraining and finetuning was done with 48 and 16 H100 Nvidia GPUs, respectively.
61
 
62
  ## Model Card Authors
63
 
 
10
  ---
11
  # Model Card for Kyutai TTS
12
 
13
+ See also the [project page](https://kyutai.org/next/tts),
14
+ the [Colab example](https://colab.research.google.com/github/kyutai-labs/delayed-streams-modeling/blob/main/tts_pytorch.ipynb),
15
+ and the [GitHub repository](https://github.com/kyutai-labs/delayed-streams-modeling/). Pre-print research paper is coming soon!
16
 
17
  This is a model for streaming text-to-speech (TTS).
18
  Unlike offline text-to-speech, where the model needs the entire text to produce the audio,
 
22
 
23
  The model architecture is a hierarchical Transformer that consumes tokenized text and generateds audio tokenized by Mimi,
24
  see [the Moshi paper](https://arxiv.org/abs/2410.00037).
25
+ The frame rate is 12.5 Hz and each audio frame is represented by 32 audio tokens, although you can use less tokens at inference time for faster generation.
26
+ The backbone model is 1B parameters, and the depth transformer is 600M parameters and uses partial weight sharing similar to [Hibiki](https://arxiv.org/abs/2502.03382).
27
+ The audio is shifted by 16 steps (1.28 sec.) with respect to the text, and the model uses an acoustic/semantic delay of 2.
28
 
29
  ## Model Description
30
 
 
42
 
43
  ### Direct Use
44
 
45
+ This model is able to perform streaming text-to-speech generation, including dialogs.
46
+ The model supports voice conditioning through cross-attention pre-computed embeddings,
47
+ which are provided for a number of voices in our [tts-voices](https://huggingface.co/kyutai/tts-voices) repository.
48
+ This model does not support Classifier Free Guidance (CFG) directly, but was trained with *CFG distillation* for improved speed (no need to double the batch size).
49
+ It is easy to batch and can reach a throughput of 75x generated audio per compute unit of time.
50
+
51
+ This model does not perform watermarking for two reasons:
52
+ - watermarking can easily be deactivated for open source models,
53
+ - our early experiments show that all watermark systems used by existing TTS are removed by simply encodeding and decoding the audio with Mimi.
54
+
55
+ Instead, we prefered to restrict the voice cloning ability to the use of pre-computed voice embeddings.
56
 
57
  ## How to Get Started with the Model
58
 
 
60
 
61
  ## Training Details
62
 
63
+ The model was trained for 750k steps, with a batch size of 64, and a segment duration of 120 seconds.
64
+
65
  ### Training Data
66
 
67
  Pretraining stage: we use an audio collection of 2.5 million hours of publicly available audio content.
68
+ For this dataset, we obtained synthetic transcripts by running [whisper-timestamped](https://github.com/linto-ai/whisper-timestamped)
69
+ with `whisper-medium`.
70
 
 
71
 
72
  ### Compute Infrastructure
73
 
74
+ Pretraining and finetuning was done with 32 H100 Nvidia GPUs.
75
 
76
  ## Model Card Authors
77