--- title: Boovore β€” Multi-Engine TTS Studio emoji: πŸŽ™ colorFrom: purple colorTo: blue sdk: docker app_port: 7860 pinned: false --- # πŸŽ™ Boovore β€” Multi-Engine TTS Studio **Boovore** is a self-hosted, GPU-accelerated Text-to-Speech studio with 6 best-in-class engines and a built-in audiobook generator. Run it on any CUDA machine (tested on RTX 3090) via a clean, dark-mode web UI. > **Name**: Boovore = *Book* + *Devour* β€” built to devour books in audio. ![Boovore UI](screenshot.png) --- ## ✨ Engines | Engine | Quality | Speed | Highlights | |---|---|---|---| | **Kokoro FR** | β˜…β˜…β˜…β˜… | ⚑⚑⚑ | Native French voices | | **Chatterbox** | β˜…β˜…β˜…β˜…β˜… | ⚑⚑ | Zero-shot voice cloning (ResembleAI) | | **F5-TTS** | β˜…β˜…β˜…β˜… | ⚑⚑ | French voice cloning | | **Fish-Speech 1.5** | β˜…β˜…β˜…β˜…β˜… | ⚑⚑ | Multilingual voice cloning (fishaudio) | | **Qwen3-TTS** | β˜…β˜…β˜…β˜…β˜… | ⚑ | Clone Β· Custom Β· Voice Design | | **Voxtral 4B** | β˜…β˜…β˜…β˜…β˜… | ⚑⚑ | French-first, 68% win vs ElevenLabs (Mistral AI) | > **Voxtral** uses vLLM-Omni (`mistralai/Voxtral-4B-TTS-2603`) with voice cloning via a reference WAV. Start it separately with `python3 voxtral_server.py`. --- ## βš™οΈ CPU / GPU β€” HuggingFace Space Settings In your Space β†’ **Settings β†’ Variables and secrets**, set: | `ENABLED_ENGINES` | Hardware | Engines available | |---|---|---| | `kokoro,f5` | CPU (free tier) | Kokoro Β· F5-TTS | | `kokoro,f5,chatterbox` | GPU T4 (~6 GB) | + Chatterbox | | `all` | GPU A10G / A100 | All engines + Qwen3 | > Default is `all` β€” on free CPU tier, set `kokoro,f5` to avoid crashes. For **Voxtral**, also set `VOXTRAL_URL` to point to your vLLM-Omni server (default: `http://localhost:8000`). --- ## πŸš€ Quick Start (Vast.ai / GPU server) ### 1. Install dependencies ```bash # PyTorch nightly CUDA 12.8 (required) pip3 install --pre torch torchaudio --index-url https://download.pytorch.org/whl/nightly/cu128 # Register torch libs so torchaudio can find libtorch echo "/usr/local/lib/python3.12/dist-packages/torch/lib" > /etc/ld.so.conf.d/torch.conf && ldconfig # Core engines pip3 install faster-qwen3-tts kokoro f5-tts fastapi uvicorn[standard] python-multipart # Chatterbox (Python 3.12 fix) pip3 install conformer==0.3.2 --no-build-isolation git clone https://github.com/resemble-ai/chatterbox /tmp/chatterbox cd /tmp/chatterbox && pip3 install -e . --no-deps && cd /root # Fish-Speech 1.5 git clone https://github.com/fishaudio/fish-speech /tmp/fish-speech cd /tmp/fish-speech && git checkout v1.5.1 pip3 install -e . --no-deps huggingface-cli download fishaudio/fish-speech-1.5 --local-dir /root/fish-speech-model ``` ### 2. (Optional) Start Voxtral TTS server Voxtral requires a separate vLLM-Omni process (~8 GB VRAM). Needs a HuggingFace token β€” accept the CC BY-NC license at [mistralai/Voxtral-4B-TTS-2603](https://huggingface.co/mistralai/Voxtral-4B-TTS-2603) first. ```bash pip install "vllm[audio]>=0.18.0" httpx soundfile export HF_TOKEN=hf_xxxx nohup python3 voxtral_server.py >> /root/voxtral.log 2>&1 & # Wait 5-10 min for model download + load (first run only) ``` Optionally generate a narrator reference WAV (for voice cloning): ```bash # While the Qwen3 server is running: python3 make_narrator_reference.py # Output: /workspace/narrator_reference.wav ``` ### 3. Start the main server ```bash nohup python3 server.py --port 7860 >> /root/server.log 2>&1 & ``` ### 3. Open the UI ```bash # Local SSH tunnel ssh -p root@ -L 7860:localhost:7860 -N # Then open http://localhost:7860 ``` --- ## πŸ“– Features - **TTS Studio** β€” one-click engine selector (8 pills), single generate button - **Audiobook Generator** β€” import `.txt` / `.pdf` / `.epub`, auto-detect chapters, batch generate with any engine, download per chapter or merge into one WAV - **Voice Cloning** β€” upload a reference audio clip (Chatterbox, F5-TTS, Fish-Speech, Qwen3) - **Real-time metrics** β€” TTFA, RTF, duration, buffer - **Light / dark theme** - **Streaming audio** (Qwen3) with CUDA Graphs --- ## πŸ—‚ Project Structure ``` server.py β€” FastAPI backend (6 engines) index.html β€” UI single-page (vanilla JS, no frontend deps) voxtral_server.py β€” vLLM-Omni server manager (start/stop/status) make_narrator_reference.py β€” Generate narrator reference WAV via Qwen3 narrator_reference.wav β€” (generated) voice clone reference for Voxtral requirements.txt Dockerfile ``` --- ## βš™οΈ Requirements - Python 3.12+ - CUDA 12.8 (RTX 3090 or better recommended) - PyTorch nightly cu128 (`2.12.0.dev+`) - VRAM: 8 GB minimum, 24 GB to run all engines simultaneously --- ## πŸ“¦ Models (auto-downloaded) | ModΓ¨le | Taille | Engine | |---|---|---| | `Qwen/Qwen3-TTS-12Hz-0.6B-Base` | ~1.2 GB | Qwen3-TTS | | `hexgrad/Kokoro-82M` | ~300 MB | Kokoro FR | | `SWivid/F5-TTS` | ~1.2 GB | F5-TTS | | `resemble-ai/chatterbox` | ~1.5 GB | Chatterbox | | `fishaudio/fish-speech-1.5` | ~1.4 GB | Fish-Speech | | `mistralai/Voxtral-4B-TTS-2603` | ~8 GB (BF16) | Voxtral (gated β€” HF token required) | --- ## 🏷️ GitHub Topics `text-to-speech` `tts` `voice-cloning` `audiobook` `french-tts` `kokoro` `f5-tts` `fish-speech` `chatterbox` `qwen3` `voxtral` `mistral` `vllm` `fastapi` `cuda` `self-hosted` `gpu` `french` `multilingual` --- ## Credits - [faster-qwen3-tts](https://github.com/huggingfaceM4/faster-qwen3-tts) β€” Qwen3-TTS engine - [Fish-Speech](https://github.com/fishaudio/fish-speech) β€” fishaudio - [Chatterbox](https://github.com/resemble-ai/chatterbox) β€” ResembleAI - [F5-TTS](https://github.com/SWivid/F5-TTS) β€” SWivid - [Kokoro](https://github.com/hexgrad/kokoro) β€” hexgrad - [Voxtral](https://mistral.ai) β€” Mistral AI (`mistralai/Voxtral-4B-TTS-2603`, CC BY-NC) - French prosody preprocessing inspired by [arXiv:2508.17494](https://arxiv.org/abs/2508.17494) --- MIT License