Update README.md
Browse files
README.md
CHANGED
|
@@ -7,14 +7,12 @@ tags:
|
|
| 7 |
- acoustic modelling
|
| 8 |
- speech
|
| 9 |
- multispeaker
|
|
|
|
| 10 |
pipeline_tag: text-to-speech
|
| 11 |
license: cc-by-nc-4.0
|
| 12 |
-
datasets:
|
| 13 |
-
- projecte-aina/festcat_trimmed_denoised
|
| 14 |
-
- openslr
|
| 15 |
---
|
| 16 |
|
| 17 |
-
# Matcha-TTS Catalan Multiaccent
|
| 18 |
|
| 19 |
## Table of Contents
|
| 20 |
<details>
|
|
@@ -32,23 +30,27 @@ datasets:
|
|
| 32 |
|
| 33 |
## Model Description
|
| 34 |
|
| 35 |
-
**Matcha-TTS** is an encoder-decoder architecture designed for fast acoustic modelling in TTS.
|
| 36 |
The encoder part is based on a text encoder and a phoneme duration prediction that together predict averaged acoustic features.
|
| 37 |
And the decoder has essentially a U-Net backbone inspired by [Grad-TTS](https://arxiv.org/pdf/2105.06337.pdf), which is based on the Transformer architecture.
|
| 38 |
In the latter, by replacing 2D CNNs by 1D CNNs, a large reduction in memory consumption and fast synthesis is achieved.
|
| 39 |
|
| 40 |
-
**
|
| 41 |
This yields an ODE-based decoder capable of generating high output quality in fewer synthesis steps than models trained using score matching.
|
| 42 |
|
| 43 |
## Intended Uses and Limitations
|
| 44 |
|
| 45 |
This model is intended to serve as an acoustic feature generator for multispeaker text-to-speech systems for the Catalan language.
|
| 46 |
-
It has been finetuned using a Catalan phonemizer, therefore if the model is used for other languages it
|
| 47 |
its output into a speech waveform.
|
| 48 |
|
| 49 |
The quality of the samples can vary depending on the speaker.
|
| 50 |
This may be due to the sensitivity of the model in learning specific frequencies and also due to the quality of samples for each speaker.
|
| 51 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 52 |
## How to Get Started with the Model
|
| 53 |
|
| 54 |
### Installation
|
|
@@ -64,7 +66,7 @@ python -m venv /path/to/venv
|
|
| 64 |
source /path/to/venv/bin/activate
|
| 65 |
```
|
| 66 |
|
| 67 |
-
For training and
|
| 68 |
```bash
|
| 69 |
git clone https://github.com/projecte-aina/espeak-ng.git
|
| 70 |
|
|
@@ -97,8 +99,8 @@ pip install -e .
|
|
| 97 |
|
| 98 |
#### PyTorch
|
| 99 |
|
| 100 |
-
Speech end-to-end inference can be done together with **Catalan
|
| 101 |
-
Both models (Catalan
|
| 102 |
|
| 103 |
First, export the following environment variables to include the installed espeak-ng version:
|
| 104 |
|
|
@@ -142,7 +144,7 @@ The model was trained on a **Multiaccent Catalan** speech dataset
|
|
| 142 |
|
| 143 |
### Training procedure
|
| 144 |
|
| 145 |
-
***Multiaccent Catalan
|
| 146 |
|
| 147 |
The embedding layer was initialized with the number of catalan speakers per accent (2) and the original hyperparameters were kept.
|
| 148 |
|
|
@@ -209,4 +211,4 @@ the voice artists. For further information, contact <langtech@bsc.es> and <lafre
|
|
| 209 |
This work has been promoted and financed by the Generalitat de Catalunya through the [Aina project](https://projecteaina.cat/).
|
| 210 |
|
| 211 |
Part of the training of the model was possible thanks to the compute time given by Galician Supercomputing Center CESGA
|
| 212 |
-
([Centro de Supercomputaci贸n de Galicia](https://www.cesga.es/)), and also by [Barcelona Supercomputing Center](https://www.bsc.es/) in MareNostrum 5.
|
|
|
|
| 7 |
- acoustic modelling
|
| 8 |
- speech
|
| 9 |
- multispeaker
|
| 10 |
+
- tts
|
| 11 |
pipeline_tag: text-to-speech
|
| 12 |
license: cc-by-nc-4.0
|
|
|
|
|
|
|
|
|
|
| 13 |
---
|
| 14 |
|
| 15 |
+
# Matxa-TTS (Matcha-TTS) Catalan Multiaccent
|
| 16 |
|
| 17 |
## Table of Contents
|
| 18 |
<details>
|
|
|
|
| 30 |
|
| 31 |
## Model Description
|
| 32 |
|
| 33 |
+
**Matxa-TTS** is based on **Matcha-TTS** that is an encoder-decoder architecture designed for fast acoustic modelling in TTS.
|
| 34 |
The encoder part is based on a text encoder and a phoneme duration prediction that together predict averaged acoustic features.
|
| 35 |
And the decoder has essentially a U-Net backbone inspired by [Grad-TTS](https://arxiv.org/pdf/2105.06337.pdf), which is based on the Transformer architecture.
|
| 36 |
In the latter, by replacing 2D CNNs by 1D CNNs, a large reduction in memory consumption and fast synthesis is achieved.
|
| 37 |
|
| 38 |
+
**Matxa-TTS** is a non-autorregressive model trained with optimal-transport conditional flow matching (OT-CFM).
|
| 39 |
This yields an ODE-based decoder capable of generating high output quality in fewer synthesis steps than models trained using score matching.
|
| 40 |
|
| 41 |
## Intended Uses and Limitations
|
| 42 |
|
| 43 |
This model is intended to serve as an acoustic feature generator for multispeaker text-to-speech systems for the Catalan language.
|
| 44 |
+
It has been finetuned using a Catalan phonemizer, therefore if the model is used for other languages it will not produce intelligible samples after mapping
|
| 45 |
its output into a speech waveform.
|
| 46 |
|
| 47 |
The quality of the samples can vary depending on the speaker.
|
| 48 |
This may be due to the sensitivity of the model in learning specific frequencies and also due to the quality of samples for each speaker.
|
| 49 |
|
| 50 |
+
As explained in the licenses section, the models can be used only for non-commercial purposes. Any parties interested in using them
|
| 51 |
+
commercially need to contact the rights holders, the voice artists for licensing their voices. For more information see the licenses section
|
| 52 |
+
under [Additional information](#additional-information).
|
| 53 |
+
|
| 54 |
## How to Get Started with the Model
|
| 55 |
|
| 56 |
### Installation
|
|
|
|
| 66 |
source /path/to/venv/bin/activate
|
| 67 |
```
|
| 68 |
|
| 69 |
+
For training and synthesizing with Catalan Matxa-TTS you need to compile the provided espeak-ng with the Catalan phonemizer:
|
| 70 |
```bash
|
| 71 |
git clone https://github.com/projecte-aina/espeak-ng.git
|
| 72 |
|
|
|
|
| 99 |
|
| 100 |
#### PyTorch
|
| 101 |
|
| 102 |
+
Speech end-to-end inference can be done together with **Catalan Matxa-TTS**.
|
| 103 |
+
Both models (Catalan Matxa-TTS and alVoCat) are loaded remotely from the HF hub.
|
| 104 |
|
| 105 |
First, export the following environment variables to include the installed espeak-ng version:
|
| 106 |
|
|
|
|
| 144 |
|
| 145 |
### Training procedure
|
| 146 |
|
| 147 |
+
***Matxa Multiaccent Catalan*** was finetuned from a catalan central [multispeaker checkpoint](https://huggingface.co/BSC-LT/matcha-tts-cat-multispeaker), that was trained on 28 hours of data from multiple speakers.
|
| 148 |
|
| 149 |
The embedding layer was initialized with the number of catalan speakers per accent (2) and the original hyperparameters were kept.
|
| 150 |
|
|
|
|
| 211 |
This work has been promoted and financed by the Generalitat de Catalunya through the [Aina project](https://projecteaina.cat/).
|
| 212 |
|
| 213 |
Part of the training of the model was possible thanks to the compute time given by Galician Supercomputing Center CESGA
|
| 214 |
+
([Centro de Supercomputaci贸n de Galicia](https://www.cesga.es/)), and also by [Barcelona Supercomputing Center](https://www.bsc.es/) in MareNostrum 5.
|