--- title: Qwen3-TTS Voice Clone emoji: 🎤 colorFrom: blue colorTo: purple sdk: docker pinned: false license: apache-2.0 --- # Qwen3-TTS Voice Clone A full-stack web application that extracts and packages the voice cloning functionality from Qwen3-TTS-12Hz-1.7B-Base. Clone any voice with just 3-10 seconds of reference audio through an intuitive web interface. ## Features - **Fast Voice Cloning**: Generate high-quality voice clones from 3-10 second audio samples - **Multi-language Support**: Supports 10+ languages including Chinese, English, Japanese, Korean - **High Similarity**: Achieves up to 95% voice similarity with quality reference audio - **Modern Web UI**: Professional React frontend with intuitive step-by-step workflow - **Real-time Validation**: Instant feedback and smart error prevention - **Docker Ready**: One-command deployment with automatic model download ## Demo **Live Demo (CPU)**: [Hugging Face Spaces](https://huggingface.co/spaces/chienweichang/qwen3-tts-voice-clone-cpu) > Note: The live demo runs on CPU environment for accessibility. For faster generation, deploy locally with GPU support. ## Quick Start ### Using Docker (Recommended) ```bash # Clone the repository git clone https://github.com/ammosu/qwen3-tts-voice-clone.git cd qwen3-tts-voice-clone # Build and run with Docker docker build -t qwen3-tts . docker run -d -p 7860:7860 --name qwen3-tts qwen3-tts # With local model (to avoid re-downloading) docker run -d -p 7860:7860 \ -v /path/to/models/Qwen3-TTS-12Hz-1.7B-Base:/app/models/Qwen3-TTS-12Hz-1.7B-Base:ro \ --name qwen3-tts qwen3-tts ``` Access the application at http://localhost:7860 ### Local Development ```bash # Backend cd backend pip install -r requirements.txt python -m uvicorn main:app --reload --host 0.0.0.0 --port 8000 # Frontend (in another terminal) cd frontend yarn install yarn dev ``` Access frontend at http://localhost:3000, backend API at http://localhost:8000 ## Usage 1. **Upload Reference Audio**: Upload a 3-10 second audio clip of the voice you want to clone 2. **Enter Reference Text**: Type the exact transcript of what's said in the audio 3. **Enter Target Text**: Type the text you want the cloned voice to say 4. **Select Language**: Choose the language for text-to-speech generation 5. **Generate**: Click "Generate Voice" and download the result ## Architecture ``` qwen3-tts-voice-clone/ ├── backend/ # FastAPI backend │ ├── main.py # API endpoints │ ├── uploads/ # Temporary uploaded files │ └── outputs/ # Generated audio files ├── frontend/ # React frontend │ ├── src/ │ │ ├── App.tsx # Main application │ │ └── ... │ └── dist/ # Built static files ├── models/ # Model directory │ └── Qwen3-TTS-12Hz-1.7B-Base/ ├── Dockerfile # Multi-stage Docker build └── docker-entrypoint.sh # Container startup script ``` ## Technology Stack ### Frontend - React 18 with TypeScript - Tailwind CSS for styling - Vite for build tooling - Lucide React for icons ### Backend - FastAPI for REST API - Python 3.12+ - Qwen3-TTS for voice synthesis - PyTorch for model inference - Nginx for serving static files in production ### Deployment - Docker multi-stage builds - Nginx reverse proxy - Automatic model download from Hugging Face Hub ## Performance - **Processing Time**: 10-20 seconds per generation (CPU mode) - **Memory Usage**: 8-12GB RAM required - **Model Size**: ~4.3GB - **Sample Rate**: 12kHz - **RTF (Real-time Factor)**: ~0.5-0.6x ## API Documentation When running locally, access API documentation at: - Swagger UI: http://localhost:8000/docs - ReDoc: http://localhost:8000/redoc ### Key Endpoints - `POST /upload` - Upload reference audio - `POST /clone` - Generate cloned voice - `GET /download/{audio_id}` - Download generated audio - `GET /api/status` - Check service status ## Development ### Project Setup ```bash # Install Python dependencies with uv uv sync # Install frontend dependencies cd frontend && yarn install # Download model (one-time setup) ./setup.sh ``` ### Running Tests ```bash # Test backend python test_backend.py # Test model import python -c "from qwen_tts import Qwen3TTSModel; print('OK')" ``` ## Configuration ### Environment Variables Create a `.env` file in the project root: ```env MODEL_PATH=models/Qwen3-TTS-12Hz-1.7B-Base UPLOAD_DIR=backend/uploads OUTPUT_DIR=backend/outputs USE_CPU=false CORS_ORIGINS=http://localhost:3000,http://localhost:5173 ``` ### Frontend Configuration Set `VITE_API_URL` in `frontend/.env`: ```env VITE_API_URL=http://localhost:8000 ``` ## Deployment ### Docker Deployment The project includes a production-ready Dockerfile with: - Multi-stage build for optimized image size - Nginx serving both frontend and backend - Automatic model download on first run - Health checks and logging ### Hugging Face Spaces This project can be deployed directly to Hugging Face Spaces: 1. Create a new Space with Docker SDK 2. Push the repository 3. The Dockerfile will handle all setup automatically ## Contributing Contributions are welcome! Please feel free to submit a Pull Request. 1. Fork the repository 2. Create your feature branch (`git checkout -b feature/AmazingFeature`) 3. Commit your changes (`git commit -m 'Add some AmazingFeature'`) 4. Push to the branch (`git push origin feature/AmazingFeature`) 5. Open a Pull Request ## Acknowledgments This project extracts and packages the voice cloning feature from: - [Qwen3-TTS](https://github.com/QwenLM/Qwen3-TTS) by Alibaba Qwen Team - Model: [Qwen3-TTS-12Hz-1.7B-Base](https://huggingface.co/Qwen/Qwen3-TTS-12Hz-1.7B-Base) All credit for the core TTS technology goes to the Qwen team. ## License This project is licensed under the Apache License 2.0 - see the [LICENSE](LICENSE) file for details. ## Support - Issues: [GitHub Issues](https://github.com/ammosu/qwen3-tts-voice-clone/issues) - Discussions: [GitHub Discussions](https://github.com/ammosu/qwen3-tts-voice-clone/discussions) ## Changelog See [CLAUDE.md](CLAUDE.md) for detailed project documentation and development notes.