🍵 Matxa-TTS (Matcha-TTS) v2 Central Catalan Graphemes

Click to expand

Summary
Model description
Intended uses and limitations
How to Use
Training
Evaluation
Citation
Additional information

Summary

Here we present Matxa-TTS 🍵 v2: a Catalan multispeaker neural TTS model working with graphemes. It works together with a vocoder model to generate high quality and expressive speech efficiently. You can use any vocoder such as 🥑 alVoCat or Wavenext.

Model Description

Matxa-TTS 🍵 is based on Matcha-TTS, an encoder-decoder architecture designed for fast acoustic modelling in TTS. The encoder part is based on a text encoder and a phoneme duration prediction that together predict averaged acoustic features. The decoder has essentially a U-Net backbone inspired by Grad-TTS, which is based on the Transformer architecture. By replacing 2D CNNs by 1D CNNs in the decoder, a large reduction in memory consumption and fast synthesis is achieved.

Matxa-TTS is a non-autorregressive model trained with optimal-transport conditional flow matching (OT-CFM). This yields an ODE-based decoder capable of generating high output quality in fewer synthesis steps than models trained using score matching.

Modifications

Decoder's number of channels and attention head dimension were increased to 512 and 128, respectively
Two timestep sampling techniques for diffusion were added: (1) stratified sampling, where each batch sample is assigned a different segment of [0,1) to guarantee uniform coverage of diffusion time; and (2) a cosine timestep scheduler
A denoiser is implemented on the generated spectrogram during inference

For more details, please, check our new github repository branch.

Intended Uses and Limitations

This model is intended to serve as an acoustic feature generator for multispeaker text-to-speech systems for the Catalan language. It has been finetuned using a Catalan text, therefore if the model is used for other languages it will not produce intelligible samples after mapping its output into a speech waveform.

The quality of the samples can vary depending on the speaker. This may be due to the sensitivity of the model in learning specific frequencies and also due to the quality of samples for each speaker.

For more information see the licenses section under Additional information.

How to Use

Installation

Create a virtual environment:

python -m venv /path/to/venv
source /path/to/venv/bin/activate

Clone the repository:

git clone -b v2 --single-branch https://github.com/langtech-bsc/Matcha-TTS.git
cd Matcha-TTS

Install the package from source:

pip install -e .

Test on Collab Onnx

Try out Matxa-TTS v2 with graphemes with AlVoCat!

Note that input texts, if containing any numbers, need to be normalized!

Training

This model has been trained in two stages:

A pre-training stage using a subset of Catalan data from CommonVoice (CV). More specifically, we used the annotated subset of CV 17, which includes the same Catalan accents as the final fine-tuning dataset. We created a smaller, balanced version in terms of accent distribution. Since the Northern accent had only around 2 hours of data, we retained all its samples. In total, we trained on 36.2 hours of data: Balearic (7.27h), Central (9.72h), Northern (1.49h), North-Western (9.21h), and Valencian (8.53h); with final hyperparameters:

Learning Rate → 1e-4
Scheduler → no
Batch Size → 32
GPUs → 2
Num. Steps → 150K (~12 hours with an H100)

A fine-tuning stage using Openslr69 and Festcat The first two layers of Matxa's text encoder, and also the prenet and the embeddings were frozen, as they had already learned accent-specific phoneme sequences during pre-training. Given the smaller dataset, we allowed the model to focus more on the decoder and on refining the speaker embeddings for improved quality. The hyperparameters in this case are:

Learning Rate → 5e-4
Scheduler → Cosine Annealing with a minimum learning rate 5e-5
Batch Size → 32
GPUs → 2
Num. Steps → 12K (~2 hours with an H100)

Evaluation

We evaluated the model in terms of quality and intelligibility using a test set of 350 sentences extracted from La creació d'Eva i altres contes by Josep Carner i Puig-Oriol, available on Project Gutenberg. Speech quality was assessed using ScoreQ, a recent approach for evaluating synthesized speech. Intelligibility was measured by computing Word Error Rate (WER) and Character Error Rate (CER) using the whisper-bsc-large-v3-cat model, which was trained and evaluated on multi-accented Catalan data.

We implemented AlVoCat as a vocoder, already trained on Catalan data. For comparison, we evaluate Matxa-TTS using the same set of sentences. Our updated model, Matxa-TTS-v2-ca-central-graphemes, achieves better scores in both quality and intelligibility.

Model	ScoreQ
Matxa-TTS-multispeaker	2.96
Matxa-TTS-v2-ca-central-graphemes	3,32

Model	Mean WER	Mean CER
Matxa-TTS-multispeaker	9,03	3,40
Matxa-TTS-v2-ca-central-graphemes	6,84	2,09

Citation

If this code contributes to your research, please cite the work:

@misc{LangtechVeu2025matxattsv2multispeaker,
      title={Catalan Multispeaker Text-to-Speech: Matxa-TTS v2}, 
      author={Jose Giraldo and Alex Peiró-Lilja},
      organization={Barcelona Supercomputing Center},
      url={https://huggingface.co/BSC-LT/matxa-tts-v2-ca-central-graphemes},
      year={2025}
}

Additional Information

Author

The Speech Team at the AI Institute, Barcelona Supercomputing Center.

Contact

For further information, please send an email to bsc-lt@bsc.es.

Copyright

License

GPL-3.0

Funding

This work is funded by the Ministerio para la Transformación Digital y de la Función Pública - Funded by EU – NextGenerationEU within the framework of the project ILENIA with reference 2022/TL22/00215337.

The conversion of the model was possible thanks to the computing time provided by Barcelona Supercomputing Center through MareNostrum 5.

Downloads last month: 21

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Datasets used to train BSC-LT/matxa-tts-v2-ca-central-graphemes

Papers for BSC-LT/matxa-tts-v2-ca-central-graphemes

SCOREQ: Speech Quality Assessment with Contrastive Regression

Paper • 2410.06675 • Published Oct 9, 2024

Grad-TTS: A Diffusion Probabilistic Model for Text-to-Speech

Paper • 2105.06337 • Published May 13, 2021

BSC-LT
/

matxa-tts-v2-ca-central-graphemes