XTTS v2 - Antonio Oliveira (European Portuguese / PT-PT)
Model Description
This is a fine-tuned XTTS v2 model trained specifically for European Portuguese (PT-PT) - the Portuguese spoken in Portugal. The base XTTS v2 model has a strong Brazilian Portuguese bias, making it difficult to generate authentic Portuguese from Portugal speech. This fine-tuned version addresses that limitation.
Key Features
- Authentic PT-PT accent - Trained on European Portuguese speech, not Brazilian
- Male voice - Natural male voice characteristics
- High quality - 200 epochs of training with optimized loss weights
- Drop-in replacement - Works with standard Coqui TTS / XTTS v2 code
- Multilingual capable - Can still generate speech in other XTTS-supported languages
Why This Model?
European Portuguese (PT-PT) TTS voices are extremely rare in open source. Most Portuguese TTS models are trained on Brazilian Portuguese (PT-BR), which has significantly different pronunciation, intonation, and rhythm. This model fills that gap for:
- Portuguese companies and developers
- Language learning applications
- Accessibility tools for Portugal
- Voice assistants targeting European Portuguese speakers
- Game localization for the Portuguese market
- Audiobook generation in PT-PT
Usage
Installation
pip install TTS
Quick Start (Python)
from TTS.tts.configs.xtts_config import XttsConfig
from TTS.tts.models.xtts import Xtts
from huggingface_hub import hf_hub_download
import torch
import os
# Download model files
model_path = "Martim-Ramos-Neural/xtts-v2-antonio-oliveira-pt-pt"
local_dir = "./xtts-antonio-oliveira"
for filename in ["model.pth", "config.json", "vocab.json", "dvae.pth", "mel_stats.pth"]:
hf_hub_download(repo_id=model_path, filename=filename, local_dir=local_dir)
# Load model
config = XttsConfig()
config.load_json(os.path.join(local_dir, "config.json"))
model = Xtts.init_from_config(config)
model.load_checkpoint(
config,
checkpoint_dir=local_dir,
use_deepspeed=False
)
model.cuda()
# Generate speech
outputs = model.synthesize(
text="Olรก, bem-vindo a Portugal! Este รฉ um exemplo de sรญntese de voz em Portuguรชs Europeu.",
config=config,
speaker_wav="/path/to/reference.wav", # Any 6+ second audio sample
language="pt",
gpt_cond_len=6,
temperature=0.7,
)
# Save audio
import scipy.io.wavfile as wav
wav.write("output.wav", 24000, outputs["wav"])
Using with Coqui TTS CLI
# Clone a voice with this model
tts --model_path ./xtts-antonio-oliveira/model.pth \
--config_path ./xtts-antonio-oliveira/config.json \
--text "Bom dia! Como estรก?" \
--speaker_wav reference.wav \
--language_idx pt \
--out_path output.wav
Docker / API Usage
This model works with the speech-services Docker setup for self-hosted TTS:
# Set environment variable to use fine-tuned model
XTTS_FINETUNED_CHECKPOINT=/path/to/model.pth
# API call
curl -X POST http://localhost:52000/synthesize \
-H "Content-Type: application/json" \
-d '{"text": "Olรก mundo!", "language": "pt"}' \
--output speech.wav
Training Methodology
1. Problem Statement
The XTTS v2 base model exhibits a significant domain bias toward Brazilian Portuguese (PT-BR) phonology, prosody, and intonation patterns. This bias manifests in generated speech regardless of the reference speaker's accent, presenting a challenge for applications requiring authentic European Portuguese (PT-PT) synthesis. Our objective was to adapt the pre-trained model to produce native PT-PT speech characteristics while preserving the model's multilingual capabilities.
2. Dataset
| Attribute | Specification |
|---|---|
| Total Duration | 62 minutes 52 seconds |
| Number of Utterances | 503 |
| Mean Utterance Length | 7.5 ยฑ 3.2 seconds |
| Sample Rate | 22,050 Hz |
| Bit Depth | 16-bit PCM |
| Speaker | Single male native PT-PT speaker |
| Recording Conditions | Studio environment, SNR > 40 dB |
The dataset was preprocessed using voice activity detection (VAD) segmentation with minimum segment duration of 3 seconds and maximum of 15 seconds. Audio was normalized to -3 dB peak amplitude and noise-reduced using spectral gating.
3. Fine-tuning Approach
We employed full fine-tuning of the GPT-2 based autoregressive decoder within the XTTS v2 architecture. The model comprises approximately 467M parameters in the GPT component, all of which were updated during training.
3.1 Base Model Architecture
XTTS v2 utilizes a hybrid architecture consisting of:
- Text Encoder: Transformer-based text processing with language embeddings
- GPT Decoder: 30-layer GPT-2 variant for autoregressive audio token prediction
- HiFi-GAN Vocoder: Neural vocoder for waveform synthesis from mel-spectrograms
- Speaker Encoder: Conditioning mechanism for voice characteristics
3.2 Training Configuration
| Hyperparameter | Value | Justification |
|---|---|---|
| Optimizer | AdamW | Standard for transformer fine-tuning |
| Learning Rate | 5 ร 10โปโต | Determined via ablation (see ยง4) |
| LR Schedule | Linear warmup + decay | 10% warmup steps |
| Batch Size | 1 | Memory constraints |
| Gradient Accumulation | 42 steps | Effective batch โ 42 |
| Epochs | 200 | Early stopping patience: 50 |
| Max Conditioning Length | 132,300 samples (~6s) | Speaker embedding extraction |
| Total Training Steps | 90,750 | ~3.5 hours on single GPU |
3.3 Loss Function Modification
A critical insight from our experiments was that the default loss weighting prioritizes text intelligibility over acoustic fidelity to the target speaker. For accent adaptation, this balance must be inverted. We modified the composite loss function:
| Component | Default | Modified | Ratio Change |
|---|---|---|---|
| ฮฑ (mel-spectrogram CE) | 1.0 | 3.0 | +200% |
| ฮฒ (text alignment CE) | 0.01 | 0.001 | -90% |
| Weight Decay (ฮป) | 0.01 | 0.0001 | -99% |
The reduced weight decay (ฮป) permits greater deviation from pre-trained weights, essential for overcoming the PT-BR bias embedded in the base model's parameter space.
4. Ablation Study
We conducted systematic experiments to determine optimal hyperparameters for accent transfer:
| Experiment | Epochs | ฮท (LR) | Loss Weights | Observations |
|---|---|---|---|---|
| Baseline | 10 | 5ร10โปโถ | Default | Minimal adaptation; PT-BR prosody preserved |
| Exp. 2 | 50 | 1ร10โปโต | Default | Partial adaptation; hybrid accent |
| Exp. 3 | 100 | 2ร10โปโต | Default | Training instability; loss divergence |
| Exp. 4 | 100 | 1ร10โปโต | Default | Stable; moderate PT-PT characteristics |
| Exp. 5 | 200 | 5ร10โปโต | Modified | Optimal PT-PT adaptation |
Key Findings:
- Standard fine-tuning (Exp. 1-4) insufficient for complete accent transfer
- Extended training alone insufficient without loss reweighting
- Modified loss weights essential for prioritizing acoustic characteristics
- Learning rate 5ร10โปโต with 200 epochs achieved convergence without instability
5. Training Dynamics
Final training metrics:
- Final Training Loss: 0.002
- Evaluation Loss (mel-CE): 22.1
- Convergence: Achieved at epoch ~180
The elevated mel-CE evaluation loss (compared to base model) is expected and acceptable, as it reflects the model's deviation from the PT-BR-biased training distribution toward PT-PT acoustic patterns.
6. Computational Resources
| Resource | Specification |
|---|---|
| Hardware | NVIDIA DGX Spark |
| GPU | NVIDIA GB10 (Grace Blackwell) |
| GPU Memory | 128 GB unified |
| Training Time | 3 hours 25 minutes |
| Framework | PyTorch 2.6+ with Coqui TTS |
Model Files
| File | Size | Description |
|---|---|---|
model.pth |
1.9 GB | Fine-tuned GPT weights (inference-optimized) |
config.json |
4.3 KB | Model configuration |
vocab.json |
353 KB | Tokenizer vocabulary |
dvae.pth |
201 MB | Discrete VAE (from base) |
mel_stats.pth |
1.1 KB | Mel normalization stats |
Limitations
- Reference audio input: XTTS architecture requires a reference audio input, but after fine-tuning the voice characteristics are baked into the model - any short audio clip works as reference
- Single speaker: This model produces one specific voice (Antonio Oliveira PT-PT male voice)
- Language mixing: While multilingual, mixing PT-PT with other languages in same sentence may produce inconsistent results
Intended Use
Primary Use Cases
- Text-to-speech for European Portuguese content
- Voice assistants and chatbots for Portugal market
- Audiobook narration in PT-PT
- Accessibility applications (screen readers, etc.)
- Language learning tools
- Game and media localization
Out of Scope
- Real-time voice conversion
- Speaker verification/identification
- Impersonation or deceptive purposes
License
This model is released under the Coqui Public Model License (CPML), which allows:
- Commercial and non-commercial use
- Modification and distribution
- No royalties required
Please review the full license for complete terms.
Citation
If you use this model in your research or applications, please cite:
@misc{antonio-oliveira-xtts-pt-pt,
author = {Ramos, Martim},
title = {XTTS v2 Antonio Oliveira: A Fine-tuned Text-to-Speech Model for European Portuguese},
year = {2025},
publisher = {Hugging Face Hub},
url = {https://huggingface.co/Martim-Ramos-Neural/xtts-v2-antonio-oliveira-pt-pt},
note = {Fine-tuned from coqui/XTTS-v2 with modified loss weighting for accent adaptation}
}
Acknowledgments
- Coqui AI for the amazing XTTS v2 base model
- The open-source TTS community
Contact
For questions, issues, or collaboration:
- Hugging Face: @Martim-Ramos-Neural
Made with love in Portugal
- Downloads last month
- 60
Model tree for Martim-Ramos-Neural/xtts-v2-antonio-oliveira-pt-pt
Base model
coqui/XTTS-v2Space using Martim-Ramos-Neural/xtts-v2-antonio-oliveira-pt-pt 1
Evaluation results
- Final Training Lossself-reported0.002
- Eval Loss (mel_ce)self-reported22.100