XTTS v2 - Antonio Oliveira (European Portuguese / PT-PT)

Open-source fine-tuned XTTS v2 voice for European Portuguese (Portugal)

Model Description

This is a fine-tuned XTTS v2 model trained specifically for European Portuguese (PT-PT) - the Portuguese spoken in Portugal. The base XTTS v2 model has a strong Brazilian Portuguese bias, making it difficult to generate authentic Portuguese from Portugal speech. This fine-tuned version addresses that limitation.

Key Features

Authentic PT-PT accent - Trained on European Portuguese speech, not Brazilian
Male voice - Natural male voice characteristics
High quality - 200 epochs of training with optimized loss weights
Drop-in replacement - Works with standard Coqui TTS / XTTS v2 code
Multilingual capable - Can still generate speech in other XTTS-supported languages

Why This Model?

European Portuguese (PT-PT) TTS voices are extremely rare in open source. Most Portuguese TTS models are trained on Brazilian Portuguese (PT-BR), which has significantly different pronunciation, intonation, and rhythm. This model fills that gap for:

Portuguese companies and developers
Language learning applications
Accessibility tools for Portugal
Voice assistants targeting European Portuguese speakers
Game localization for the Portuguese market
Audiobook generation in PT-PT

Usage

Installation

pip install TTS

Quick Start (Python)

from TTS.tts.configs.xtts_config import XttsConfig
from TTS.tts.models.xtts import Xtts
from huggingface_hub import hf_hub_download
import torch
import os

# Download model files
model_path = "Martim-Ramos-Neural/xtts-v2-antonio-oliveira-pt-pt"
local_dir = "./xtts-antonio-oliveira"

for filename in ["model.pth", "config.json", "vocab.json", "dvae.pth", "mel_stats.pth"]:
    hf_hub_download(repo_id=model_path, filename=filename, local_dir=local_dir)

# Load model
config = XttsConfig()
config.load_json(os.path.join(local_dir, "config.json"))
model = Xtts.init_from_config(config)
model.load_checkpoint(
    config,
    checkpoint_dir=local_dir,
    use_deepspeed=False
)
model.cuda()

# Generate speech
outputs = model.synthesize(
    text="Olá, bem-vindo a Portugal! Este é um exemplo de síntese de voz em Português Europeu.",
    config=config,
    speaker_wav="/path/to/reference.wav",  # Any 6+ second audio sample
    language="pt",
    gpt_cond_len=6,
    temperature=0.7,
)

# Save audio
import scipy.io.wavfile as wav
wav.write("output.wav", 24000, outputs["wav"])

Using with Coqui TTS CLI

# Clone a voice with this model
tts --model_path ./xtts-antonio-oliveira/model.pth \
    --config_path ./xtts-antonio-oliveira/config.json \
    --text "Bom dia! Como está?" \
    --speaker_wav reference.wav \
    --language_idx pt \
    --out_path output.wav

Docker / API Usage

This model works with the speech-services Docker setup for self-hosted TTS:

# Set environment variable to use fine-tuned model
XTTS_FINETUNED_CHECKPOINT=/path/to/model.pth

# API call
curl -X POST http://localhost:52000/synthesize \
  -H "Content-Type: application/json" \
  -d '{"text": "Olá mundo!", "language": "pt"}' \
  --output speech.wav

Training Methodology

1. Problem Statement

The XTTS v2 base model exhibits a significant domain bias toward Brazilian Portuguese (PT-BR) phonology, prosody, and intonation patterns. This bias manifests in generated speech regardless of the reference speaker's accent, presenting a challenge for applications requiring authentic European Portuguese (PT-PT) synthesis. Our objective was to adapt the pre-trained model to produce native PT-PT speech characteristics while preserving the model's multilingual capabilities.

2. Dataset

Attribute	Specification
Total Duration	62 minutes 52 seconds
Number of Utterances	503
Mean Utterance Length	7.5 ± 3.2 seconds
Sample Rate	22,050 Hz
Bit Depth	16-bit PCM
Speaker	Single male native PT-PT speaker
Recording Conditions	Studio environment, SNR > 40 dB

The dataset was preprocessed using voice activity detection (VAD) segmentation with minimum segment duration of 3 seconds and maximum of 15 seconds. Audio was normalized to -3 dB peak amplitude and noise-reduced using spectral gating.

3. Fine-tuning Approach

We employed full fine-tuning of the GPT-2 based autoregressive decoder within the XTTS v2 architecture. The model comprises approximately 467M parameters in the GPT component, all of which were updated during training.

3.1 Base Model Architecture

XTTS v2 utilizes a hybrid architecture consisting of:

Text Encoder: Transformer-based text processing with language embeddings
GPT Decoder: 30-layer GPT-2 variant for autoregressive audio token prediction
HiFi-GAN Vocoder: Neural vocoder for waveform synthesis from mel-spectrograms
Speaker Encoder: Conditioning mechanism for voice characteristics

3.2 Training Configuration

Hyperparameter	Value	Justification
Optimizer	AdamW	Standard for transformer fine-tuning
Learning Rate	5 × 10⁻⁵	Determined via ablation (see §4)
LR Schedule	Linear warmup + decay	10% warmup steps
Batch Size	1	Memory constraints
Gradient Accumulation	42 steps	Effective batch ≈ 42
Epochs	200	Early stopping patience: 50
Max Conditioning Length	132,300 samples (~6s)	Speaker embedding extraction
Total Training Steps	90,750	~3.5 hours on single GPU

3.3 Loss Function Modification

A critical insight from our experiments was that the default loss weighting prioritizes text intelligibility over acoustic fidelity to the target speaker. For accent adaptation, this balance must be inverted. We modified the composite loss function:

$\mathcal{L}_{total} = \alpha \cdot \mathcal{L}_{mel-ce} + \beta \cdot \mathcal{L}_{text-ce}$

Component	Default	Modified	Ratio Change
α (mel-spectrogram CE)	1.0	3.0	+200%
β (text alignment CE)	0.01	0.001	-90%
Weight Decay (λ)	0.01	0.0001	-99%

The reduced weight decay (λ) permits greater deviation from pre-trained weights, essential for overcoming the PT-BR bias embedded in the base model's parameter space.

4. Ablation Study

We conducted systematic experiments to determine optimal hyperparameters for accent transfer:

Experiment	Epochs	η (LR)	Loss Weights	Observations
Baseline	10	5×10⁻⁶	Default	Minimal adaptation; PT-BR prosody preserved
Exp. 2	50	1×10⁻⁵	Default	Partial adaptation; hybrid accent
Exp. 3	100	2×10⁻⁵	Default	Training instability; loss divergence
Exp. 4	100	1×10⁻⁵	Default	Stable; moderate PT-PT characteristics
Exp. 5	200	5×10⁻⁵	Modified	Optimal PT-PT adaptation

Key Findings:

Standard fine-tuning (Exp. 1-4) insufficient for complete accent transfer
Extended training alone insufficient without loss reweighting
Modified loss weights essential for prioritizing acoustic characteristics
Learning rate 5×10⁻⁵ with 200 epochs achieved convergence without instability

5. Training Dynamics

Final training metrics:

Final Training Loss: 0.002
Evaluation Loss (mel-CE): 22.1
Convergence: Achieved at epoch ~180

The elevated mel-CE evaluation loss (compared to base model) is expected and acceptable, as it reflects the model's deviation from the PT-BR-biased training distribution toward PT-PT acoustic patterns.

6. Computational Resources

Resource	Specification
Hardware	NVIDIA DGX Spark
GPU	NVIDIA GB10 (Grace Blackwell)
GPU Memory	128 GB unified
Training Time	3 hours 25 minutes
Framework	PyTorch 2.6+ with Coqui TTS

Model Files

File	Size	Description
`model.pth`	1.9 GB	Fine-tuned GPT weights (inference-optimized)
`config.json`	4.3 KB	Model configuration
`vocab.json`	353 KB	Tokenizer vocabulary
`dvae.pth`	201 MB	Discrete VAE (from base)
`mel_stats.pth`	1.1 KB	Mel normalization stats

Limitations

Reference audio input: XTTS architecture requires a reference audio input, but after fine-tuning the voice characteristics are baked into the model - any short audio clip works as reference
Single speaker: This model produces one specific voice (Antonio Oliveira PT-PT male voice)
Language mixing: While multilingual, mixing PT-PT with other languages in same sentence may produce inconsistent results

Intended Use

Primary Use Cases

Text-to-speech for European Portuguese content
Voice assistants and chatbots for Portugal market
Audiobook narration in PT-PT
Accessibility applications (screen readers, etc.)
Language learning tools
Game and media localization

Out of Scope

Real-time voice conversion
Speaker verification/identification
Impersonation or deceptive purposes

License

This model is released under the Coqui Public Model License (CPML), which allows:

Commercial and non-commercial use
Modification and distribution
No royalties required

Please review the full license for complete terms.

Citation

If you use this model in your research or applications, please cite:

@misc{antonio-oliveira-xtts-pt-pt,
  author = {Ramos, Martim},
  title = {XTTS v2 Antonio Oliveira: A Fine-tuned Text-to-Speech Model for European Portuguese},
  year = {2025},
  publisher = {Hugging Face Hub},
  url = {https://huggingface.co/Martim-Ramos-Neural/xtts-v2-antonio-oliveira-pt-pt},
  note = {Fine-tuned from coqui/XTTS-v2 with modified loss weighting for accent adaptation}
}

Acknowledgments

Coqui AI for the amazing XTTS v2 base model
The open-source TTS community

Contact

For questions, issues, or collaboration:

Hugging Face: @Martim-Ramos-Neural

Made with love in Portugal

Downloads last month: 60

Model tree for Martim-Ramos-Neural/xtts-v2-antonio-oliveira-pt-pt

Base model

coqui/XTTS-v2

Finetuned

(67)

this model

Space using Martim-Ramos-Neural/xtts-v2-antonio-oliveira-pt-pt 1

Evaluation results

Final Training Loss
self-reported

0.002
Eval Loss (mel_ce)
self-reported

22.100