XTTS v2 - Antonio Oliveira (European Portuguese / PT-PT)

Open-source fine-tuned XTTS v2 voice for European Portuguese (Portugal)

License: CPML Language: PT-PT Base Model: XTTS v2

Model Description

This is a fine-tuned XTTS v2 model trained specifically for European Portuguese (PT-PT) - the Portuguese spoken in Portugal. The base XTTS v2 model has a strong Brazilian Portuguese bias, making it difficult to generate authentic Portuguese from Portugal speech. This fine-tuned version addresses that limitation.

Key Features

  • Authentic PT-PT accent - Trained on European Portuguese speech, not Brazilian
  • Male voice - Natural male voice characteristics
  • High quality - 200 epochs of training with optimized loss weights
  • Drop-in replacement - Works with standard Coqui TTS / XTTS v2 code
  • Multilingual capable - Can still generate speech in other XTTS-supported languages

Why This Model?

European Portuguese (PT-PT) TTS voices are extremely rare in open source. Most Portuguese TTS models are trained on Brazilian Portuguese (PT-BR), which has significantly different pronunciation, intonation, and rhythm. This model fills that gap for:

  • Portuguese companies and developers
  • Language learning applications
  • Accessibility tools for Portugal
  • Voice assistants targeting European Portuguese speakers
  • Game localization for the Portuguese market
  • Audiobook generation in PT-PT

Usage

Installation

pip install TTS

Quick Start (Python)

from TTS.tts.configs.xtts_config import XttsConfig
from TTS.tts.models.xtts import Xtts
from huggingface_hub import hf_hub_download
import torch
import os

# Download model files
model_path = "Martim-Ramos-Neural/xtts-v2-antonio-oliveira-pt-pt"
local_dir = "./xtts-antonio-oliveira"

for filename in ["model.pth", "config.json", "vocab.json", "dvae.pth", "mel_stats.pth"]:
    hf_hub_download(repo_id=model_path, filename=filename, local_dir=local_dir)

# Load model
config = XttsConfig()
config.load_json(os.path.join(local_dir, "config.json"))
model = Xtts.init_from_config(config)
model.load_checkpoint(
    config,
    checkpoint_dir=local_dir,
    use_deepspeed=False
)
model.cuda()

# Generate speech
outputs = model.synthesize(
    text="Olรก, bem-vindo a Portugal! Este รฉ um exemplo de sรญntese de voz em Portuguรชs Europeu.",
    config=config,
    speaker_wav="/path/to/reference.wav",  # Any 6+ second audio sample
    language="pt",
    gpt_cond_len=6,
    temperature=0.7,
)

# Save audio
import scipy.io.wavfile as wav
wav.write("output.wav", 24000, outputs["wav"])

Using with Coqui TTS CLI

# Clone a voice with this model
tts --model_path ./xtts-antonio-oliveira/model.pth \
    --config_path ./xtts-antonio-oliveira/config.json \
    --text "Bom dia! Como estรก?" \
    --speaker_wav reference.wav \
    --language_idx pt \
    --out_path output.wav

Docker / API Usage

This model works with the speech-services Docker setup for self-hosted TTS:

# Set environment variable to use fine-tuned model
XTTS_FINETUNED_CHECKPOINT=/path/to/model.pth

# API call
curl -X POST http://localhost:52000/synthesize \
  -H "Content-Type: application/json" \
  -d '{"text": "Olรก mundo!", "language": "pt"}' \
  --output speech.wav

Training Methodology

1. Problem Statement

The XTTS v2 base model exhibits a significant domain bias toward Brazilian Portuguese (PT-BR) phonology, prosody, and intonation patterns. This bias manifests in generated speech regardless of the reference speaker's accent, presenting a challenge for applications requiring authentic European Portuguese (PT-PT) synthesis. Our objective was to adapt the pre-trained model to produce native PT-PT speech characteristics while preserving the model's multilingual capabilities.

2. Dataset

Attribute Specification
Total Duration 62 minutes 52 seconds
Number of Utterances 503
Mean Utterance Length 7.5 ยฑ 3.2 seconds
Sample Rate 22,050 Hz
Bit Depth 16-bit PCM
Speaker Single male native PT-PT speaker
Recording Conditions Studio environment, SNR > 40 dB

The dataset was preprocessed using voice activity detection (VAD) segmentation with minimum segment duration of 3 seconds and maximum of 15 seconds. Audio was normalized to -3 dB peak amplitude and noise-reduced using spectral gating.

3. Fine-tuning Approach

We employed full fine-tuning of the GPT-2 based autoregressive decoder within the XTTS v2 architecture. The model comprises approximately 467M parameters in the GPT component, all of which were updated during training.

3.1 Base Model Architecture

XTTS v2 utilizes a hybrid architecture consisting of:

  • Text Encoder: Transformer-based text processing with language embeddings
  • GPT Decoder: 30-layer GPT-2 variant for autoregressive audio token prediction
  • HiFi-GAN Vocoder: Neural vocoder for waveform synthesis from mel-spectrograms
  • Speaker Encoder: Conditioning mechanism for voice characteristics

3.2 Training Configuration

Hyperparameter Value Justification
Optimizer AdamW Standard for transformer fine-tuning
Learning Rate 5 ร— 10โปโต Determined via ablation (see ยง4)
LR Schedule Linear warmup + decay 10% warmup steps
Batch Size 1 Memory constraints
Gradient Accumulation 42 steps Effective batch โ‰ˆ 42
Epochs 200 Early stopping patience: 50
Max Conditioning Length 132,300 samples (~6s) Speaker embedding extraction
Total Training Steps 90,750 ~3.5 hours on single GPU

3.3 Loss Function Modification

A critical insight from our experiments was that the default loss weighting prioritizes text intelligibility over acoustic fidelity to the target speaker. For accent adaptation, this balance must be inverted. We modified the composite loss function:

Ltotal=ฮฑโ‹…Lmelโˆ’ce+ฮฒโ‹…Ltextโˆ’ce\mathcal{L}_{total} = \alpha \cdot \mathcal{L}_{mel-ce} + \beta \cdot \mathcal{L}_{text-ce}

Component Default Modified Ratio Change
ฮฑ (mel-spectrogram CE) 1.0 3.0 +200%
ฮฒ (text alignment CE) 0.01 0.001 -90%
Weight Decay (ฮป) 0.01 0.0001 -99%

The reduced weight decay (ฮป) permits greater deviation from pre-trained weights, essential for overcoming the PT-BR bias embedded in the base model's parameter space.

4. Ablation Study

We conducted systematic experiments to determine optimal hyperparameters for accent transfer:

Experiment Epochs ฮท (LR) Loss Weights Observations
Baseline 10 5ร—10โปโถ Default Minimal adaptation; PT-BR prosody preserved
Exp. 2 50 1ร—10โปโต Default Partial adaptation; hybrid accent
Exp. 3 100 2ร—10โปโต Default Training instability; loss divergence
Exp. 4 100 1ร—10โปโต Default Stable; moderate PT-PT characteristics
Exp. 5 200 5ร—10โปโต Modified Optimal PT-PT adaptation

Key Findings:

  1. Standard fine-tuning (Exp. 1-4) insufficient for complete accent transfer
  2. Extended training alone insufficient without loss reweighting
  3. Modified loss weights essential for prioritizing acoustic characteristics
  4. Learning rate 5ร—10โปโต with 200 epochs achieved convergence without instability

5. Training Dynamics

Final training metrics:

  • Final Training Loss: 0.002
  • Evaluation Loss (mel-CE): 22.1
  • Convergence: Achieved at epoch ~180

The elevated mel-CE evaluation loss (compared to base model) is expected and acceptable, as it reflects the model's deviation from the PT-BR-biased training distribution toward PT-PT acoustic patterns.

6. Computational Resources

Resource Specification
Hardware NVIDIA DGX Spark
GPU NVIDIA GB10 (Grace Blackwell)
GPU Memory 128 GB unified
Training Time 3 hours 25 minutes
Framework PyTorch 2.6+ with Coqui TTS

Model Files

File Size Description
model.pth 1.9 GB Fine-tuned GPT weights (inference-optimized)
config.json 4.3 KB Model configuration
vocab.json 353 KB Tokenizer vocabulary
dvae.pth 201 MB Discrete VAE (from base)
mel_stats.pth 1.1 KB Mel normalization stats

Limitations

  • Reference audio input: XTTS architecture requires a reference audio input, but after fine-tuning the voice characteristics are baked into the model - any short audio clip works as reference
  • Single speaker: This model produces one specific voice (Antonio Oliveira PT-PT male voice)
  • Language mixing: While multilingual, mixing PT-PT with other languages in same sentence may produce inconsistent results

Intended Use

Primary Use Cases

  • Text-to-speech for European Portuguese content
  • Voice assistants and chatbots for Portugal market
  • Audiobook narration in PT-PT
  • Accessibility applications (screen readers, etc.)
  • Language learning tools
  • Game and media localization

Out of Scope

  • Real-time voice conversion
  • Speaker verification/identification
  • Impersonation or deceptive purposes

License

This model is released under the Coqui Public Model License (CPML), which allows:

  • Commercial and non-commercial use
  • Modification and distribution
  • No royalties required

Please review the full license for complete terms.

Citation

If you use this model in your research or applications, please cite:

@misc{antonio-oliveira-xtts-pt-pt,
  author = {Ramos, Martim},
  title = {XTTS v2 Antonio Oliveira: A Fine-tuned Text-to-Speech Model for European Portuguese},
  year = {2025},
  publisher = {Hugging Face Hub},
  url = {https://huggingface.co/Martim-Ramos-Neural/xtts-v2-antonio-oliveira-pt-pt},
  note = {Fine-tuned from coqui/XTTS-v2 with modified loss weighting for accent adaptation}
}

Acknowledgments

  • Coqui AI for the amazing XTTS v2 base model
  • The open-source TTS community

Contact

For questions, issues, or collaboration:


Made with love in Portugal

Downloads last month
60
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for Martim-Ramos-Neural/xtts-v2-antonio-oliveira-pt-pt

Base model

coqui/XTTS-v2
Finetuned
(67)
this model

Space using Martim-Ramos-Neural/xtts-v2-antonio-oliveira-pt-pt 1

Evaluation results