You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Fine-tuned Spark-TTS: German Emotional Speech Synthesis

This repository contains a fine-tuned implementation of the Spark-TTS (0.5B) model, specialized for German speech synthesis with advanced support for emotional cues and non-verbal audio tokens.

The model was fine-tuned using LoRA (Low-Rank Adaptation) on a curated German dataset containing high-quality audio with diverse emotional expressions and non-verbal cues.

🚀 Key Highlights

57.14% Loss Improvement: Reduced test loss from 10.0074 (Base) to 4.2891 (Fine-tuned).
Emotional Support: Handles stylistic tags like [happy], [angry], and [thoughtful].
Non-Verbal Tokens: Accurately synthesizes non-speech sounds like [sighs], [laughter], [yawn], and [growl].
Architecture: Spark-TTS (0.5B) with LoRA Adapters.

🏋️ Training Details

The model was trained using the following optimal parameters:

Learning Rate: 0.0005
LoRA Rank (R): 64
LoRA Alpha: 64
Precision: 4-bit (bitsandbytes)

🔊 Inference Example

To use this model, you need to load the base Spark-TTS 0.5B model and apply these LoRA adapters.

from peft import PeftModel
from transformers import AutoModelForCausalLM

# Load base model (ensure you have the Spark-TTS architecture code)
base_model = AutoModelForCausalLM.from_pretrained("SparkAudio/Spark-TTS-0.5B", trust_remote_code=True)

# Load LoRA adapters
model = PeftModel.from_pretrained(base_model, "Vishalshendge3198/spark-tts-german-emotional")

📊 Performance

Metric	Base Model (0.5B)	Fine-tuned (German)	Improvement
Test Loss	10.0074	4.2891	57.14%
German Prosody	Basic	Advanced	High

📜 Credits

Developed as part of a German TTS fine-tuning project using the Spark-TTS architecture by SparkAudio.

Downloads last month: -