---
library_name: chatterbox
tags:
- chatterbox
- text-to-speech
- tts
- german
- kartoffel
- speech-generation
- voice-cloning
- turbo
language:
- de
base_model:
- ResembleAI/chatterbox-turbo
pipeline_tag: text-to-speech
license: cc-by-4.0
---

# ⚡ Kartoffelbox-Turbo
### German Text-to-Speech

![Kartoffelbox](https://huggingface.co/SebastianBodza/Kartoffelbox_Turbo/resolve/main/kartoffel.webp "Kartoffelbox")


**Kartoffelbox-Turbo** is a fine-tuned version of [Resemble AI's Chatterbox-Turbo](https://github.com/resemble-ai/chatterbox), optimized specifically for the German language. 

Built on the **350M parameter Turbo architecture**, this model delivers German speech generation with significantly lower compute requirements and reduced latency compared to previous 500M+ parameter versions.

<div style="display: flex; align-items: center; gap: 12px">
  <a href="https://huggingface.co/spaces/SebastianBodza/chatterbox-turbo-demo">
    <img src="https://huggingface.co/datasets/huggingface/badges/resolve/main/open-in-hf-spaces-sm.svg" alt="Open in HF Spaces" />
  </a>
</div>

## Key Features

* **⚡ Turbo Speed:** Built on the Chatterbox-Turbo architecture (350M params), fast synthesis.
* **🇩🇪 German Optimized:** Fine-tuned specifically for natural German prosody and pronunciation.
* **Low Resource:** Runs efficiently with less VRAM than the standard 500M model.

## ⚠️ Limitations & Paralinguistic Tags  
**Current Status:** Experimental  
Please note that this model is an experimental release. During the final training phase, the loss diverged after 2.5 days. 
* **Paralinguistic Tags:** I only used the Paralinguistic features (such as `[laugh]`, `[sigh]`, `[breath]`) during the final fine-tuning stage. Due to the training divergence, **these tags are likely not supported** in this version.


## Installation

You need the base `chatterbox-tts` library to run this model.

```bash
pip install chatterbox-tts

```

## Usage

Because this is a fine-tune of the Turbo model, you must load the base architecture first and then apply the Kartoffelbox weights to the `t3` module.

```python
import torch
import torchaudio
from chatterbox.tts_turbo import ChatterboxTurboTTS
from huggingface_hub import hf_hub_download

# 1. Define Model Repository
MODEL_REPO = "SebastianBodza/Kartoffelbox_Turbo"
MODEL_FILENAME = "model.pt"
device = "cuda" if torch.cuda.is_available() else "cpu"

# 2. Load the Base Chatterbox-Turbo Model
print("Loading base Turbo model...")
model = ChatterboxTurboTTS.from_pretrained(device)

# 3. Download and Load the Fine-Tuned German Weights
print(f"Downloading weights from {MODEL_REPO}...")
checkpoint_path = hf_hub_download(repo_id=MODEL_REPO, filename=MODEL_FILENAME)
checkpoint_state = torch.load(checkpoint_path, map_location=device)

# Clean and apply state dict to the t3 module
cleaned_state_dict = {
    k.replace("_orig_mod.", ""): v for k, v in checkpoint_state.items()
}
model.t3.load_state_dict(cleaned_state_dict)
model.t3.eval()
print("✓ Kartoffel-Turbo weights loaded successfully.")

# 4. Generate Speech
text = "Elias blieb stehen. War es wirklich schon zehn Jahre her? Er musste leise lachen."

# You need a reference audio file (10-20s) for voice cloning
# Ensure the reference audio matches the tone you want
audio_prompt_path = "your_german_reference.wav" 

wav = model.generate(
    text,
    audio_prompt_path=audio_prompt_path,
    temperature=0.8,
    repetition_penalty=1.2,
    top_p=0.95
)

# 5. Save output
torchaudio.save("kartoffel_output.wav", wav.squeeze(0).cpu(), model.sr)
print("Saved to kartoffel_output.wav")

```

## Tips for Best Results

* **Reference Audio:** Use a clean, high-quality German reference clip (approx. 10-20 seconds). The model is zero-shot, so it will attempt to clone the voice provided.
* **Parameters:** * `temperature`: Controls randomness. `0.8` is a good default. Lower it for more stability, raise it for more variation.
* `repetition_penalty`: If the model stutters, try increasing this slightly (e.g., `1.2`).

## Training Metrics

This model was an initial attempt at fine-tuning the Chatterbox Turbo architecture. As the pipeline utilizes online voice cloning, the training process is computationally intensive.

Below are the plots for the Training and Validation Loss before divergence:
![Train Speech Loss](https://huggingface.co/SebastianBodza/Kartoffelbox_Turbo/resolve/main/train.webp "Train Speech Loss")
![Validation Speech Loss](https://huggingface.co/SebastianBodza/Kartoffelbox_Turbo/resolve/main/validation.webp "Validation Speech Loss")


## Acknowledgements

* **Resemble AI** for the [Chatterbox-Turbo](https://github.com/resemble-ai/chatterbox) architecture.
* **FunAudioLLM** for CosyVoice.