--- library_name: chatterbox tags: - chatterbox - text-to-speech - tts - german - kartoffel - speech-generation - voice-cloning - turbo language: - de base_model: - ResembleAI/chatterbox-turbo pipeline_tag: text-to-speech license: cc-by-4.0 --- # ⚡ Kartoffelbox-Turbo ### German Text-to-Speech ![Kartoffelbox](https://huggingface.co/SebastianBodza/Kartoffelbox_Turbo/resolve/main/kartoffel.webp "Kartoffelbox") **Kartoffelbox-Turbo** is a fine-tuned version of [Resemble AI's Chatterbox-Turbo](https://github.com/resemble-ai/chatterbox), optimized specifically for the German language. Built on the **350M parameter Turbo architecture**, this model delivers German speech generation with significantly lower compute requirements and reduced latency compared to previous 500M+ parameter versions.
Open in HF Spaces
## Key Features * **⚡ Turbo Speed:** Built on the Chatterbox-Turbo architecture (350M params), fast synthesis. * **🇩🇪 German Optimized:** Fine-tuned specifically for natural German prosody and pronunciation. * **Low Resource:** Runs efficiently with less VRAM than the standard 500M model. ## ⚠️ Limitations & Paralinguistic Tags **Current Status:** Experimental Please note that this model is an experimental release. During the final training phase, the loss diverged after 2.5 days. * **Paralinguistic Tags:** I only used the Paralinguistic features (such as `[laugh]`, `[sigh]`, `[breath]`) during the final fine-tuning stage. Due to the training divergence, **these tags are likely not supported** in this version. ## Installation You need the base `chatterbox-tts` library to run this model. ```bash pip install chatterbox-tts ``` ## Usage Because this is a fine-tune of the Turbo model, you must load the base architecture first and then apply the Kartoffelbox weights to the `t3` module. ```python import torch import torchaudio from chatterbox.tts_turbo import ChatterboxTurboTTS from huggingface_hub import hf_hub_download # 1. Define Model Repository MODEL_REPO = "SebastianBodza/Kartoffelbox_Turbo" MODEL_FILENAME = "model.pt" device = "cuda" if torch.cuda.is_available() else "cpu" # 2. Load the Base Chatterbox-Turbo Model print("Loading base Turbo model...") model = ChatterboxTurboTTS.from_pretrained(device) # 3. Download and Load the Fine-Tuned German Weights print(f"Downloading weights from {MODEL_REPO}...") checkpoint_path = hf_hub_download(repo_id=MODEL_REPO, filename=MODEL_FILENAME) checkpoint_state = torch.load(checkpoint_path, map_location=device) # Clean and apply state dict to the t3 module cleaned_state_dict = { k.replace("_orig_mod.", ""): v for k, v in checkpoint_state.items() } model.t3.load_state_dict(cleaned_state_dict) model.t3.eval() print("✓ Kartoffel-Turbo weights loaded successfully.") # 4. Generate Speech text = "Elias blieb stehen. War es wirklich schon zehn Jahre her? Er musste leise lachen." # You need a reference audio file (10-20s) for voice cloning # Ensure the reference audio matches the tone you want audio_prompt_path = "your_german_reference.wav" wav = model.generate( text, audio_prompt_path=audio_prompt_path, temperature=0.8, repetition_penalty=1.2, top_p=0.95 ) # 5. Save output torchaudio.save("kartoffel_output.wav", wav.squeeze(0).cpu(), model.sr) print("Saved to kartoffel_output.wav") ``` ## Tips for Best Results * **Reference Audio:** Use a clean, high-quality German reference clip (approx. 10-20 seconds). The model is zero-shot, so it will attempt to clone the voice provided. * **Parameters:** * `temperature`: Controls randomness. `0.8` is a good default. Lower it for more stability, raise it for more variation. * `repetition_penalty`: If the model stutters, try increasing this slightly (e.g., `1.2`). ## Training Metrics This model was an initial attempt at fine-tuning the Chatterbox Turbo architecture. As the pipeline utilizes online voice cloning, the training process is computationally intensive. Below are the plots for the Training and Validation Loss before divergence: ![Train Speech Loss](https://huggingface.co/SebastianBodza/Kartoffelbox_Turbo/resolve/main/train.webp "Train Speech Loss") ![Validation Speech Loss](https://huggingface.co/SebastianBodza/Kartoffelbox_Turbo/resolve/main/validation.webp "Validation Speech Loss") ## Acknowledgements * **Resemble AI** for the [Chatterbox-Turbo](https://github.com/resemble-ai/chatterbox) architecture. * **FunAudioLLM** for CosyVoice.