mozilla-foundation/common_voice_17_0
Updated • 5.82k • 20
The point of this model was too see whether it was possible to train a multilingual model with very little data. The quality of the results varies from language to language. Spanish is fluent, Finnish and Swedish are accented but okay, Turkish is not great and French is a total mess. The state of Estonian and Hungarian is unknown.
Training was conducted using a subset of 3000 samples from each language in the Common Voice dataset.
- --learning_rate
- "0.0001"
- --batch_size_per_gpu
- "2000"
- --batch_size_type
- frame
- --max_samples
- "96"
- --grad_accumulation_steps
- "16"
- --max_grad_norm
- "0.3"
- --epochs
- "200"
- --num_warmup_updates
- "5000"
- --save_per_updates
- "10000"
- --keep_last_n_checkpoints
- "-1"
- --last_per_updates
- "5000"
- --tokenizer
- custom
- --bnb_optimizer
{dim=1024, depth=22, heads=16, ff_mult=2, text_dim=512, conv_layers=4}
Thanks to Amos Wallgren, Calvin Guillot and Begüm Çelik for quality assurance.
Base model
SWivid/F5-TTS