CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
Project Overview
Full-stack Qwen3-TTS voice cloning application with:
- CLI tools (voice_clone.py, quick_clone.py) for command-line usage
- FastAPI backend (backend/) providing REST API
- React frontend (frontend/) with responsive design and clean UI
- Docker deployment ready for Hugging Face Spaces and local hosting
- Qwen3-TTS-12Hz-1.7B-Base model for voice synthesis with 95% similarity
The project enables 3-10 second reference audio cloning across 10 languages (Chinese, English, Japanese, Korean, etc.).
Deployment Options:
- Local development (separate backend + frontend)
- Local Docker (unified container with Nginx)
- Hugging Face Spaces (public Docker deployment with automatic model download)
System Requirements
Local Development
- Python: 3.12+ (managed via uv)
- Node.js: v18+ (frontend uses Yarn)
- GPU (optional): NVIDIA GPU with CUDA 11.8+ (uses ~4GB VRAM)
- CPU mode: Works without GPU (8-12GB RAM, slower generation)
- Model: Qwen3-TTS-12Hz-1.7B-Base (~4.3 GB, auto-downloaded or symlinked)
Docker Deployment
- Docker: 20.10+
- RAM: 8-12GB minimum (for model loading)
- Disk: ~10GB (model + dependencies)
- Network: For Hugging Face model download on first run
Development Commands
Initial Setup
# Install Python dependencies
uv sync
# Install frontend dependencies
cd frontend && yarn install
# Download/link model (choose one)
./setup.sh # Full setup with model download
./link_model.sh # Link to existing model at /path/to/model
Running the Full Stack
Backend (Terminal 1):
./backend/start_server.sh
# or manually:
uv run uvicorn backend.main:app --reload --host 0.0.0.0 --port 8000
Frontend (Terminal 2):
cd frontend
yarn dev # Local only (localhost:3000)
yarn dev --host # Network accessible (0.0.0.0:3000)
Access:
- Frontend: http://localhost:3000
- Backend API: http://localhost:8000
- API Docs: http://localhost:8000/docs
CLI Tools (Alternative to Web UI)
# Interactive mode
uv run python voice_clone.py
# Quick test with predefined config
uv run python quick_clone.py
Testing
# Test backend configuration
uv run python test_backend.py
# Test model import
uv run python -c "from qwen_tts import Qwen3TTSModel; print('Model import successful')"
# Test API endpoints
curl http://localhost:8000/api/status
Building for Production
# Build frontend
cd frontend
yarn build # Output: frontend/dist/
# Frontend build can be served with any static file server
Docker Deployment
Local Docker (Unified Container)
The project includes a complete Docker setup with Nginx serving both frontend and backend on port 7860.
Build and run:
# Build image
docker build -f Dockerfile -t qwen3-tts-hf .
# Run with model volume (avoids re-downloading)
docker run -d -p 7860:7860 \
-v /path/to/models/Qwen3-TTS-12Hz-1.7B-Base:/app/models/Qwen3-TTS-12Hz-1.7B-Base:ro \
--name qwen3-tts qwen3-tts-hf
# Or run without volume (auto-downloads model on first run)
docker run -d -p 7860:7860 --name qwen3-tts qwen3-tts-hf
Access:
- Frontend: http://localhost:7860
- Backend API: http://localhost:7860/api
Architecture:
- Nginx serves frontend from
/app/frontend/dist - Nginx reverse-proxies
/api/*to backend on port 8000 - Backend runs with
uvicornon 127.0.0.1:8000 - Model auto-downloads from Hugging Face Hub if not present
Hugging Face Spaces Deployment
The project is ready for deployment to Hugging Face Spaces with Docker SDK.
Files for HF Spaces:
Dockerfile- Multi-stage build (frontend + backend)docker-entrypoint.sh- Startup script with automatic model download.dockerignore- Excludes unnecessary files from buildREADME-HF.md- Hugging Face Spaces documentation
Deploy to HF Spaces:
# 1. Create Space at https://huggingface.co/new-space
# - Choose Docker SDK
# - Select CPU basic (free) or CPU upgrade (16GB RAM)
# 2. Push to HF Spaces
git remote add hf https://huggingface.co/spaces/YOUR_USERNAME/SPACE_NAME
git push hf main
# 3. HF Spaces will automatically:
# - Build Docker image (~5-10 min)
# - Download model on first run (~5 min)
# - Start service on port 7860
Frontend Environment Detection: The frontend automatically detects Hugging Face Spaces environment and adjusts API URLs accordingly:
const API_URL = import.meta.env.VITE_API_URL || (
window.location.hostname.includes('hf.space') || window.location.hostname.includes('huggingface.co')
? '' // HF Spaces: relative path (Nginx proxy)
: 'http://localhost:8000' // Local development
)
Docker Build Process:
- Stage 1 (frontend-builder): Builds React app with
yarn build - Stage 2 (production):
- Installs Python dependencies (CPU-only PyTorch)
- Copies built frontend to
/app/frontend/dist - Configures Nginx for unified serving
- Sets up model auto-download via
docker-entrypoint.sh
Architecture
Architecture Diagrams
Local Development (Separate Servers):
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Frontend (Vite Dev Server) โ
โ - Port 3000 or 5173 โ
โ - Hot Module Replacement โ
โโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ HTTP/REST (CORS)
โโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Backend (Uvicorn) โ
โ - Port 8000 โ
โ - FastAPI + CORS middleware โ
โโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Python API
โโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Qwen3-TTS Model โ
โ - GPU/CPU auto-detection โ
โ - FlashAttention 2 (optional) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Docker Deployment (Unified Container):
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Nginx (Port 7860) โ
โ - Serves frontend static files โ
โ - Reverse proxy /api โ backend โ
โโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโ
โ โ
โ Static Files โ Proxy /api/*
โ โ
โโโโโโโโผโโโโโโโโโโโ โโโโโโโโผโโโโโโโโโโโ
โ Frontend Dist โ โ Backend โ
โ /app/frontend/ โ โ 127.0.0.1:8000 โ
โ dist/ โ โ (Uvicorn) โ
โโโโโโโโโโโโโโโโโโโ โโโโโโโโฌโโโโโโโโโโโ
โ
โโโโโโโโผโโโโโโโโโโโ
โ Qwen3-TTS โ
โ CPU-only mode โ
โ /app/models/ โ
โโโโโโโโโโโโโโโโโโโ
Backend API (backend/main.py)
FastAPI application with 7 endpoints:
GET /- API info (version, model status, device)GET /api/status- Service status (ready/loading/error)POST /api/upload- Upload reference audio (returns audio_id)POST /api/clone- Generate cloned voice (requires ref_audio_id, ref_text, target_text)GET /api/download/{audio_id}- Download generated audioDELETE /api/audio/{audio_id}- Delete uploaded/generated audioGET /api/cleanup- Clean up old files (default: >24h)
Key Backend Patterns:
- Model singleton: Global
modelinstance loaded once at startup - UUID-based file management: All uploads/outputs use UUID filenames
- Automatic directory creation:
backend/uploads/andbackend/outputs/created on startup - CORS: Pre-configured for localhost:3000 and localhost:5173 (Vite)
API Request Example:
# 1. Upload reference audio
response = requests.post('http://localhost:8000/api/upload', files={'file': open('voice.wav', 'rb')})
audio_id = response.json()['audio_id']
# 2. Generate cloned voice
response = requests.post('http://localhost:8000/api/clone', json={
'ref_audio_id': audio_id,
'ref_text': 'ๅ่้ณ่จไธญ็ๆๅญ',
'target_text': '่ฆ็ๆ็ๆฐๆๅญ',
'language': 'Chinese',
'x_vector_only': False
})
output_id = response.json()['audio_id']
# 3. Download result
response = requests.get(f'http://localhost:8000/api/download/{output_id}')
with open('output.wav', 'wb') as f:
f.write(response.content)
Frontend (frontend/src/App.tsx)
Modern single-page React application with responsive design and clean UI:
Features:
- Responsive layout (mobile/tablet/desktop with Tailwind breakpoints)
- Increased font sizes for better readability
- File upload via drag-and-drop or click
- Audio preview for uploaded reference audio
- Real-time status updates during generation
- Clean interface with removed non-functional links
UI Improvements (Latest):
- Larger fonts across all components (base, lg, xl sizes)
- Responsive container with
max-w-7xland flexible columns - Two-column layout on desktop, single column on mobile
- Simplified navigation and footer (removed dummy links)
- Enhanced spacing and padding for better UX
API Integration:
- Environment-aware
API_URLconfiguration - Automatic detection of HF Spaces vs local development
- Uses native
fetch()for all API calls - Blob URL management for audio preview with cleanup
Key Frontend Patterns:
- useState hooks: Form state (refAudioId, refText, targetText, language, etc.)
- useRef hooks: File input, audio players (ref + generated)
- useEffect hooks: Blob URL cleanup to prevent memory leaks
- Event handlers: handleFileSelect, handleGenerate, handleDrop
- Conditional rendering: Upload status, loading states, audio players
CLI Tools (voice_clone.py, quick_clone.py)
Direct Python scripts that bypass the web stack:
- Load model directly
- Read local files from
reference_audios/ - Write output to
outputs/ - Useful for batch processing or server-side automation
voice_clone.py modes:
interactive_mode(): User prompts for audio selection, text input, languagebatch_mode(): Generate multiple texts with same voice prompt
Voice Cloning Workflow
Core Workflow (All Interfaces)
Create voice clone prompt from reference audio + text:
voice_clone_prompt = model.create_voice_clone_prompt( ref_audio="path/to/audio.wav", ref_text="transcript of the audio", x_vector_only_mode=False, # True = no ref_text needed but lower quality )Generate cloned voice:
wavs, sr = model.generate_voice_clone( text="text to synthesize", language="Chinese", # or "English", "Japanese", "Korean" voice_clone_prompt=voice_clone_prompt, )Save output:
import soundfile as sf sf.write("output.wav", wavs[0], sr)
Model Loading Pattern
All components use consistent model loading with FlashAttention 2 fallback:
try:
model = Qwen3TTSModel.from_pretrained(
"models/Qwen3-TTS-12Hz-1.7B-Base",
device_map="cuda:0" if torch.cuda.is_available() else "cpu",
dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
)
except Exception:
# Fallback to standard attention if FlashAttention 2 unavailable
model = Qwen3TTSModel.from_pretrained(
"models/Qwen3-TTS-12Hz-1.7B-Base",
device_map="cuda:0" if torch.cuda.is_available() else "cpu",
dtype=torch.bfloat16,
)
Directory Structure
qwen3clone/
โโโ backend/ # FastAPI backend
โ โโโ main.py # API endpoints and model loading
โ โโโ start_server.sh # Backend startup script
โ โโโ uploads/ # Temporary uploaded reference audio (created at runtime)
โ โโโ outputs/ # Generated audio files (created at runtime)
โ
โโโ frontend/ # React frontend
โ โโโ src/
โ โ โโโ App.tsx # Main application (responsive UI, all logic)
โ โ โโโ main.tsx # React entry point
โ โ โโโ index.css # Global styles (Tailwind directives)
โ โ โโโ vite-env.d.ts # Vite environment types
โ โโโ package.json # Frontend dependencies
โ โโโ tailwind.config.js # Tailwind color theme (custom palette)
โ โโโ vite.config.ts # Vite dev server config
โ โโโ dist/ # Built static files (created by yarn build)
โ
โโโ models/ # TTS model directory
โ โโโ Qwen3-TTS-12Hz-1.7B-Base/ # Symlink or actual model files
โ
โโโ reference_audios/ # Input: 3-10s reference audio for CLI tools
โโโ outputs/ # Output: CLI-generated .wav files
โ
โโโ voice_clone.py # CLI interactive tool
โโโ quick_clone.py # CLI quick test script
โโโ test_backend.py # Backend configuration tests
โ
โโโ Dockerfile # Docker multi-stage build for HF Spaces
โโโ docker-entrypoint.sh # Container startup script (auto model download)
โโโ .dockerignore # Docker build exclusions
โ
โโโ README-HF.md # Hugging Face Spaces documentation
โโโ CLAUDE.md # This file (project guidance for Claude Code)
โโโ setup.sh # Full environment + model setup
โโโ link_model.sh # Link to existing model
โโโ pyproject.toml # Python dependencies (uv)
Configuration
Backend Configuration (backend/main.py)
# Environment variable support (with defaults)
MODEL_PATH = os.getenv("MODEL_PATH", "models/Qwen3-TTS-12Hz-1.7B-Base")
UPLOAD_DIR = Path(os.getenv("UPLOAD_DIR", "backend/uploads"))
OUTPUT_DIR = Path(os.getenv("OUTPUT_DIR", "backend/outputs"))
USE_CPU = os.getenv("USE_CPU", "false").lower() == "true"
# CORS origins (environment variable or defaults)
cors_origins = os.getenv(
"CORS_ORIGINS",
"http://localhost:3000,http://localhost:5173,http://localhost"
).split(",")
Docker Environment Variables:
MODEL_PATH: Model directory path (default:/app/models/Qwen3-TTS-12Hz-1.7B-Base)UPLOAD_DIR: Upload directory (default:/app/backend/uploads)OUTPUT_DIR: Output directory (default:/app/backend/outputs)USE_CPU: Force CPU mode (default:truein Docker, auto-detect in dev)CORS_ORIGINS: Comma-separated allowed origins
Frontend Configuration (frontend/src/App.tsx)
// Environment-aware configuration (automatic)
const API_URL = import.meta.env.VITE_API_URL || (
window.location.hostname.includes('hf.space') || window.location.hostname.includes('huggingface.co')
? '' // HF Spaces: use relative path
: 'http://localhost:8000' // Local development
)
Configuration Methods:
- Local Development: Uses
http://localhost:8000by default - Docker/HF Spaces: Set
VITE_API_URL=/apiin Dockerfile (already configured) - Custom Network: Set environment variable
VITE_API_URL=http://your-server:8000
For network deployment, update:
- Backend CORS
allow_originsto include frontend URL - Set
VITE_API_URLenvironment variable or updateAPI_URLconstant - Run frontend with
yarn dev --hostto bind to 0.0.0.0
CLI Configuration (quick_clone.py)
Edit variables at top of file:
REF_AUDIO = "reference_audios/ref_audio.wav"
REF_TEXT = "ๅ่้ณ่จ็ๅฎๆดๅ
งๅฎน"
LANGUAGE = "Chinese"
TEST_TEXTS = ["่ฆ็ๆ็็ฌฌไธๅฅ", "่ฆ็ๆ็็ฌฌไบๅฅ"]
Reference Audio Requirements
- Duration: 3-10 seconds (optimal balance of features vs noise)
- Content: Single-speaker, clear speech, minimal background noise
- Format: WAV, MP3, or FLAC
- Transcript:
ref_textmust exactly match spoken content for best quality - Quality impact: Clean audio + accurate transcript = up to 0.95 similarity
Performance Metrics
- RTF: ~0.5-0.6x (generates 2s audio in ~1s)
- Sample rate: 12kHz
- Voice similarity: Up to 0.95 with quality reference
- GPU memory: ~4GB VRAM
- Startup time: ~5-10s (model loading)
- Supported languages: Chinese, English, Japanese, Korean, + 6 more
Key Implementation Details
CORS and Network Access
Backend CORS is pre-configured for local development. For network deployment:
Update backend
allow_originsinmain.py:allow_origins=["http://10.0.0.85:3000"] # Your server IPUpdate frontend
API_URLinApp.tsx:const API_URL = 'http://10.0.0.85:8000'Start backend with
--host 0.0.0.0(already default in start_server.sh)Start frontend with
yarn dev --hostto expose on network
File Cleanup
Generated files persist indefinitely. Use cleanup endpoint or cron job:
# Manual cleanup via API
curl "http://localhost:8000/api/cleanup?max_age_hours=24"
# Or delete directories
rm -rf backend/uploads/* backend/outputs/*
Model Symlink vs Download
Two options for model setup:
- Download (setup.sh): Downloads ~4GB model to
models/ - Symlink (link_model.sh): Links to existing model elsewhere
- Useful if model already downloaded in another project
- Example: Links to
/home/user/models/Qwen3-TTS-12Hz-1.7B-Base
FlashAttention 2 Behavior
- Automatically attempts to load FlashAttention 2 for faster inference
- Gracefully falls back to standard attention if unavailable
- No code changes needed - handled transparently
- Setup script installs flash-attn but may fail on some systems
Output Naming Conventions
- Backend API: UUID-based (
a1b2c3d4-...-xyz.wav) - CLI voice_clone.py:
{ref_audio_stem}_clone_{count:03d}.wav - CLI quick_clone.py:
clone_{count:02d}.wav - CLI batch_mode:
batch_{count:03d}.wav
Dependencies
Python (pyproject.toml)
qwen-tts: Core TTS librarytorch>=2.0.0: Deep learning frameworkfastapi>=0.109.0: Web frameworkuvicorn[standard]>=0.27.0: ASGI serverpython-multipart>=0.0.6: File upload supportsoundfile: Audio I/Oflash-attn(optional): Accelerated attention
Frontend (package.json)
react+react-dom: UI frameworklucide-react: Icon librarytypescript: Type safetyvite: Build tool and dev servertailwindcss: Utility-first CSS
Recent Improvements
Frontend UI Enhancements (Latest)
Increased Font Sizes:
- Navigation:
text-xl(fromtext-lg) - Hero title:
text-5xl md:text-6xl(fromtext-5xl) - Descriptions:
text-xl(fromtext-lg) - Buttons:
text-lg(fromtext-[15px]) - Form labels:
text-base(fromtext-[13px]) - Input/textarea:
text-base(fromtext-[13px]) - Status messages:
text-base(fromtext-[13px])
Responsive Design:
- Removed fixed width
w-[1440px] - Added responsive container:
max-w-7xl mx-auto - Two-column layout on large screens:
lg:w-1/2for each panel - Mobile-first with breakpoints:
md:,lg:prefixes - Responsive padding:
px-6 md:px-12 lg:px-24
Cleaned Interface:
- Removed non-functional navigation links (API docs, GitHub)
- Removed dummy footer links (features, pricing, tutorials)
- Removed social media icons (Twitter, LinkedIn, GitHub)
- Removed "Model Size" dropdown (non-functional)
- Removed redundant "Choose File" button (upload area is clickable)
- Simplified footer to logo + copyright only
Audio Preview:
- Added reference audio preview with play controls
- Blob URL management with proper cleanup (useEffect)
- Prevents memory leaks from unreleased object URLs
Troubleshooting
Docker Issues
Container won't start:
# Check logs
docker logs qwen3-tts
# Common issues:
# 1. Port 7860 already in use
docker ps | grep 7860
# 2. Model download failed (network issue)
# 3. Insufficient memory (need 8-12GB RAM)
Model not downloading:
- Check internet connection
- Verify Hugging Face Hub is accessible
- Try manual download:
huggingface-cli download Qwen/Qwen3-TTS-12Hz-1.7B-Base
Frontend shows 404 on API calls:
- Verify Nginx is running:
docker exec qwen3-tts nginx -t - Check backend is healthy:
docker exec qwen3-tts curl http://127.0.0.1:8000/api/status - Review
API_URLconfiguration in frontend
Local Development Issues
Backend won't start:
- Check model exists:
ls models/Qwen3-TTS-12Hz-1.7B-Base/ - If missing, run
./setup.shor./link_model.sh - Verify Python version:
python --version(need 3.12+)
Port already in use:
# Check what's using the port
lsof -i :8000 # Backend
lsof -i :3000 # Frontend (Yarn)
lsof -i :7860 # Docker
# Kill process or change port in configuration
Frontend can't connect to backend:
- Verify backend is running:
curl http://localhost:8000/api/status - Check CORS settings in
backend/main.py - Ensure
API_URLinfrontend/src/App.tsxmatches backend address - For network access, use
yarn dev --hostand update CORS origins
Model loading fails:
- Verify CUDA availability:
python -c "import torch; print(torch.cuda.is_available())" - Check GPU memory: Should have ~4GB free
- Try CPU mode: Set
USE_CPU=trueenvironment variable - CPU mode slower but works without GPU (8-12GB RAM needed)