# CLAUDE.md This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository. ## Project Overview Full-stack Qwen3-TTS voice cloning application with: - **CLI tools** (voice_clone.py, quick_clone.py) for command-line usage - **FastAPI backend** (backend/) providing REST API - **React frontend** (frontend/) with responsive design and clean UI - **Docker deployment** ready for Hugging Face Spaces and local hosting - **Qwen3-TTS-12Hz-1.7B-Base model** for voice synthesis with 95% similarity The project enables 3-10 second reference audio cloning across 10 languages (Chinese, English, Japanese, Korean, etc.). **Deployment Options:** 1. Local development (separate backend + frontend) 2. Local Docker (unified container with Nginx) 3. Hugging Face Spaces (public Docker deployment with automatic model download) ## System Requirements ### Local Development - **Python**: 3.12+ (managed via uv) - **Node.js**: v18+ (frontend uses Yarn) - **GPU** (optional): NVIDIA GPU with CUDA 11.8+ (uses ~4GB VRAM) - **CPU mode**: Works without GPU (8-12GB RAM, slower generation) - **Model**: Qwen3-TTS-12Hz-1.7B-Base (~4.3 GB, auto-downloaded or symlinked) ### Docker Deployment - **Docker**: 20.10+ - **RAM**: 8-12GB minimum (for model loading) - **Disk**: ~10GB (model + dependencies) - **Network**: For Hugging Face model download on first run ## Development Commands ### Initial Setup ```bash # Install Python dependencies uv sync # Install frontend dependencies cd frontend && yarn install # Download/link model (choose one) ./setup.sh # Full setup with model download ./link_model.sh # Link to existing model at /path/to/model ``` ### Running the Full Stack **Backend (Terminal 1):** ```bash ./backend/start_server.sh # or manually: uv run uvicorn backend.main:app --reload --host 0.0.0.0 --port 8000 ``` **Frontend (Terminal 2):** ```bash cd frontend yarn dev # Local only (localhost:3000) yarn dev --host # Network accessible (0.0.0.0:3000) ``` Access: - Frontend: http://localhost:3000 - Backend API: http://localhost:8000 - API Docs: http://localhost:8000/docs ### CLI Tools (Alternative to Web UI) ```bash # Interactive mode uv run python voice_clone.py # Quick test with predefined config uv run python quick_clone.py ``` ### Testing ```bash # Test backend configuration uv run python test_backend.py # Test model import uv run python -c "from qwen_tts import Qwen3TTSModel; print('Model import successful')" # Test API endpoints curl http://localhost:8000/api/status ``` ### Building for Production ```bash # Build frontend cd frontend yarn build # Output: frontend/dist/ # Frontend build can be served with any static file server ``` ## Docker Deployment ### Local Docker (Unified Container) The project includes a complete Docker setup with Nginx serving both frontend and backend on port 7860. **Build and run:** ```bash # Build image docker build -f Dockerfile -t qwen3-tts-hf . # Run with model volume (avoids re-downloading) docker run -d -p 7860:7860 \ -v /path/to/models/Qwen3-TTS-12Hz-1.7B-Base:/app/models/Qwen3-TTS-12Hz-1.7B-Base:ro \ --name qwen3-tts qwen3-tts-hf # Or run without volume (auto-downloads model on first run) docker run -d -p 7860:7860 --name qwen3-tts qwen3-tts-hf ``` **Access:** - Frontend: http://localhost:7860 - Backend API: http://localhost:7860/api **Architecture:** - Nginx serves frontend from `/app/frontend/dist` - Nginx reverse-proxies `/api/*` to backend on port 8000 - Backend runs with `uvicorn` on 127.0.0.1:8000 - Model auto-downloads from Hugging Face Hub if not present ### Hugging Face Spaces Deployment The project is ready for deployment to Hugging Face Spaces with Docker SDK. **Files for HF Spaces:** - `Dockerfile` - Multi-stage build (frontend + backend) - `docker-entrypoint.sh` - Startup script with automatic model download - `.dockerignore` - Excludes unnecessary files from build - `README-HF.md` - Hugging Face Spaces documentation **Deploy to HF Spaces:** ```bash # 1. Create Space at https://huggingface.co/new-space # - Choose Docker SDK # - Select CPU basic (free) or CPU upgrade (16GB RAM) # 2. Push to HF Spaces git remote add hf https://huggingface.co/spaces/YOUR_USERNAME/SPACE_NAME git push hf main # 3. HF Spaces will automatically: # - Build Docker image (~5-10 min) # - Download model on first run (~5 min) # - Start service on port 7860 ``` **Frontend Environment Detection:** The frontend automatically detects Hugging Face Spaces environment and adjusts API URLs accordingly: ```typescript const API_URL = import.meta.env.VITE_API_URL || ( window.location.hostname.includes('hf.space') || window.location.hostname.includes('huggingface.co') ? '' // HF Spaces: relative path (Nginx proxy) : 'http://localhost:8000' // Local development ) ``` **Docker Build Process:** 1. **Stage 1 (frontend-builder)**: Builds React app with `yarn build` 2. **Stage 2 (production)**: - Installs Python dependencies (CPU-only PyTorch) - Copies built frontend to `/app/frontend/dist` - Configures Nginx for unified serving - Sets up model auto-download via `docker-entrypoint.sh` ## Architecture ### Architecture Diagrams **Local Development (Separate Servers):** ``` ┌─────────────────────────────────────────┐ │ Frontend (Vite Dev Server) │ │ - Port 3000 or 5173 │ │ - Hot Module Replacement │ └──────────────┬──────────────────────────┘ │ HTTP/REST (CORS) ┌──────────────▼──────────────────────────┐ │ Backend (Uvicorn) │ │ - Port 8000 │ │ - FastAPI + CORS middleware │ └──────────────┬──────────────────────────┘ │ Python API ┌──────────────▼──────────────────────────┐ │ Qwen3-TTS Model │ │ - GPU/CPU auto-detection │ │ - FlashAttention 2 (optional) │ └─────────────────────────────────────────┘ ``` **Docker Deployment (Unified Container):** ``` ┌─────────────────────────────────────────┐ │ Nginx (Port 7860) │ │ - Serves frontend static files │ │ - Reverse proxy /api → backend │ └──────┬──────────────────────┬───────────┘ │ │ │ Static Files │ Proxy /api/* │ │ ┌──────▼──────────┐ ┌──────▼──────────┐ │ Frontend Dist │ │ Backend │ │ /app/frontend/ │ │ 127.0.0.1:8000 │ │ dist/ │ │ (Uvicorn) │ └─────────────────┘ └──────┬──────────┘ │ ┌──────▼──────────┐ │ Qwen3-TTS │ │ CPU-only mode │ │ /app/models/ │ └─────────────────┘ ``` ### Backend API (backend/main.py) FastAPI application with 7 endpoints: - `GET /` - API info (version, model status, device) - `GET /api/status` - Service status (ready/loading/error) - `POST /api/upload` - Upload reference audio (returns audio_id) - `POST /api/clone` - Generate cloned voice (requires ref_audio_id, ref_text, target_text) - `GET /api/download/{audio_id}` - Download generated audio - `DELETE /api/audio/{audio_id}` - Delete uploaded/generated audio - `GET /api/cleanup` - Clean up old files (default: >24h) **Key Backend Patterns:** 1. **Model singleton**: Global `model` instance loaded once at startup 2. **UUID-based file management**: All uploads/outputs use UUID filenames 3. **Automatic directory creation**: `backend/uploads/` and `backend/outputs/` created on startup 4. **CORS**: Pre-configured for localhost:3000 and localhost:5173 (Vite) **API Request Example:** ```python # 1. Upload reference audio response = requests.post('http://localhost:8000/api/upload', files={'file': open('voice.wav', 'rb')}) audio_id = response.json()['audio_id'] # 2. Generate cloned voice response = requests.post('http://localhost:8000/api/clone', json={ 'ref_audio_id': audio_id, 'ref_text': '參考音訊中的文字', 'target_text': '要生成的新文字', 'language': 'Chinese', 'x_vector_only': False }) output_id = response.json()['audio_id'] # 3. Download result response = requests.get(f'http://localhost:8000/api/download/{output_id}') with open('output.wav', 'wb') as f: f.write(response.content) ``` ### Frontend (frontend/src/App.tsx) Modern single-page React application with responsive design and clean UI: **Features:** - Responsive layout (mobile/tablet/desktop with Tailwind breakpoints) - Increased font sizes for better readability - File upload via drag-and-drop or click - Audio preview for uploaded reference audio - Real-time status updates during generation - Clean interface with removed non-functional links **UI Improvements (Latest):** - Larger fonts across all components (base, lg, xl sizes) - Responsive container with `max-w-7xl` and flexible columns - Two-column layout on desktop, single column on mobile - Simplified navigation and footer (removed dummy links) - Enhanced spacing and padding for better UX **API Integration:** - Environment-aware `API_URL` configuration - Automatic detection of HF Spaces vs local development - Uses native `fetch()` for all API calls - Blob URL management for audio preview with cleanup **Key Frontend Patterns:** 1. **useState hooks**: Form state (refAudioId, refText, targetText, language, etc.) 2. **useRef hooks**: File input, audio players (ref + generated) 3. **useEffect hooks**: Blob URL cleanup to prevent memory leaks 4. **Event handlers**: handleFileSelect, handleGenerate, handleDrop 5. **Conditional rendering**: Upload status, loading states, audio players ### CLI Tools (voice_clone.py, quick_clone.py) Direct Python scripts that bypass the web stack: - Load model directly - Read local files from `reference_audios/` - Write output to `outputs/` - Useful for batch processing or server-side automation **voice_clone.py modes:** - `interactive_mode()`: User prompts for audio selection, text input, language - `batch_mode()`: Generate multiple texts with same voice prompt ## Voice Cloning Workflow ### Core Workflow (All Interfaces) 1. **Create voice clone prompt** from reference audio + text: ```python voice_clone_prompt = model.create_voice_clone_prompt( ref_audio="path/to/audio.wav", ref_text="transcript of the audio", x_vector_only_mode=False, # True = no ref_text needed but lower quality ) ``` 2. **Generate cloned voice**: ```python wavs, sr = model.generate_voice_clone( text="text to synthesize", language="Chinese", # or "English", "Japanese", "Korean" voice_clone_prompt=voice_clone_prompt, ) ``` 3. **Save output**: ```python import soundfile as sf sf.write("output.wav", wavs[0], sr) ``` ### Model Loading Pattern All components use consistent model loading with FlashAttention 2 fallback: ```python try: model = Qwen3TTSModel.from_pretrained( "models/Qwen3-TTS-12Hz-1.7B-Base", device_map="cuda:0" if torch.cuda.is_available() else "cpu", dtype=torch.bfloat16, attn_implementation="flash_attention_2", ) except Exception: # Fallback to standard attention if FlashAttention 2 unavailable model = Qwen3TTSModel.from_pretrained( "models/Qwen3-TTS-12Hz-1.7B-Base", device_map="cuda:0" if torch.cuda.is_available() else "cpu", dtype=torch.bfloat16, ) ``` ## Directory Structure ``` qwen3clone/ ├── backend/ # FastAPI backend │ ├── main.py # API endpoints and model loading │ ├── start_server.sh # Backend startup script │ ├── uploads/ # Temporary uploaded reference audio (created at runtime) │ └── outputs/ # Generated audio files (created at runtime) │ ├── frontend/ # React frontend │ ├── src/ │ │ ├── App.tsx # Main application (responsive UI, all logic) │ │ ├── main.tsx # React entry point │ │ ├── index.css # Global styles (Tailwind directives) │ │ └── vite-env.d.ts # Vite environment types │ ├── package.json # Frontend dependencies │ ├── tailwind.config.js # Tailwind color theme (custom palette) │ ├── vite.config.ts # Vite dev server config │ └── dist/ # Built static files (created by yarn build) │ ├── models/ # TTS model directory │ └── Qwen3-TTS-12Hz-1.7B-Base/ # Symlink or actual model files │ ├── reference_audios/ # Input: 3-10s reference audio for CLI tools ├── outputs/ # Output: CLI-generated .wav files │ ├── voice_clone.py # CLI interactive tool ├── quick_clone.py # CLI quick test script ├── test_backend.py # Backend configuration tests │ ├── Dockerfile # Docker multi-stage build for HF Spaces ├── docker-entrypoint.sh # Container startup script (auto model download) ├── .dockerignore # Docker build exclusions │ ├── README-HF.md # Hugging Face Spaces documentation ├── CLAUDE.md # This file (project guidance for Claude Code) ├── setup.sh # Full environment + model setup ├── link_model.sh # Link to existing model └── pyproject.toml # Python dependencies (uv) ``` ## Configuration ### Backend Configuration (backend/main.py) ```python # Environment variable support (with defaults) MODEL_PATH = os.getenv("MODEL_PATH", "models/Qwen3-TTS-12Hz-1.7B-Base") UPLOAD_DIR = Path(os.getenv("UPLOAD_DIR", "backend/uploads")) OUTPUT_DIR = Path(os.getenv("OUTPUT_DIR", "backend/outputs")) USE_CPU = os.getenv("USE_CPU", "false").lower() == "true" # CORS origins (environment variable or defaults) cors_origins = os.getenv( "CORS_ORIGINS", "http://localhost:3000,http://localhost:5173,http://localhost" ).split(",") ``` **Docker Environment Variables:** - `MODEL_PATH`: Model directory path (default: `/app/models/Qwen3-TTS-12Hz-1.7B-Base`) - `UPLOAD_DIR`: Upload directory (default: `/app/backend/uploads`) - `OUTPUT_DIR`: Output directory (default: `/app/backend/outputs`) - `USE_CPU`: Force CPU mode (default: `true` in Docker, auto-detect in dev) - `CORS_ORIGINS`: Comma-separated allowed origins ### Frontend Configuration (frontend/src/App.tsx) ```typescript // Environment-aware configuration (automatic) const API_URL = import.meta.env.VITE_API_URL || ( window.location.hostname.includes('hf.space') || window.location.hostname.includes('huggingface.co') ? '' // HF Spaces: use relative path : 'http://localhost:8000' // Local development ) ``` **Configuration Methods:** 1. **Local Development**: Uses `http://localhost:8000` by default 2. **Docker/HF Spaces**: Set `VITE_API_URL=/api` in Dockerfile (already configured) 3. **Custom Network**: Set environment variable `VITE_API_URL=http://your-server:8000` **For network deployment**, update: 1. Backend CORS `allow_origins` to include frontend URL 2. Set `VITE_API_URL` environment variable or update `API_URL` constant 3. Run frontend with `yarn dev --host` to bind to 0.0.0.0 ### CLI Configuration (quick_clone.py) Edit variables at top of file: ```python REF_AUDIO = "reference_audios/ref_audio.wav" REF_TEXT = "參考音訊的完整內容" LANGUAGE = "Chinese" TEST_TEXTS = ["要生成的第一句", "要生成的第二句"] ``` ## Reference Audio Requirements - **Duration**: 3-10 seconds (optimal balance of features vs noise) - **Content**: Single-speaker, clear speech, minimal background noise - **Format**: WAV, MP3, or FLAC - **Transcript**: `ref_text` must **exactly match** spoken content for best quality - **Quality impact**: Clean audio + accurate transcript = up to 0.95 similarity ## Performance Metrics - **RTF**: ~0.5-0.6x (generates 2s audio in ~1s) - **Sample rate**: 12kHz - **Voice similarity**: Up to 0.95 with quality reference - **GPU memory**: ~4GB VRAM - **Startup time**: ~5-10s (model loading) - **Supported languages**: Chinese, English, Japanese, Korean, + 6 more ## Key Implementation Details ### CORS and Network Access Backend CORS is pre-configured for local development. For network deployment: 1. Update backend `allow_origins` in `main.py`: ```python allow_origins=["http://10.0.0.85:3000"] # Your server IP ``` 2. Update frontend `API_URL` in `App.tsx`: ```typescript const API_URL = 'http://10.0.0.85:8000' ``` 3. Start backend with `--host 0.0.0.0` (already default in start_server.sh) 4. Start frontend with `yarn dev --host` to expose on network ### File Cleanup Generated files persist indefinitely. Use cleanup endpoint or cron job: ```bash # Manual cleanup via API curl "http://localhost:8000/api/cleanup?max_age_hours=24" # Or delete directories rm -rf backend/uploads/* backend/outputs/* ``` ### Model Symlink vs Download Two options for model setup: 1. **Download** (setup.sh): Downloads ~4GB model to `models/` 2. **Symlink** (link_model.sh): Links to existing model elsewhere - Useful if model already downloaded in another project - Example: Links to `/home/user/models/Qwen3-TTS-12Hz-1.7B-Base` ### FlashAttention 2 Behavior - Automatically attempts to load FlashAttention 2 for faster inference - Gracefully falls back to standard attention if unavailable - No code changes needed - handled transparently - Setup script installs flash-attn but may fail on some systems ### Output Naming Conventions - **Backend API**: UUID-based (`a1b2c3d4-...-xyz.wav`) - **CLI voice_clone.py**: `{ref_audio_stem}_clone_{count:03d}.wav` - **CLI quick_clone.py**: `clone_{count:02d}.wav` - **CLI batch_mode**: `batch_{count:03d}.wav` ## Dependencies ### Python (pyproject.toml) - `qwen-tts`: Core TTS library - `torch>=2.0.0`: Deep learning framework - `fastapi>=0.109.0`: Web framework - `uvicorn[standard]>=0.27.0`: ASGI server - `python-multipart>=0.0.6`: File upload support - `soundfile`: Audio I/O - `flash-attn` (optional): Accelerated attention ### Frontend (package.json) - `react` + `react-dom`: UI framework - `lucide-react`: Icon library - `typescript`: Type safety - `vite`: Build tool and dev server - `tailwindcss`: Utility-first CSS ## Recent Improvements ### Frontend UI Enhancements (Latest) **Increased Font Sizes:** - Navigation: `text-xl` (from `text-lg`) - Hero title: `text-5xl md:text-6xl` (from `text-5xl`) - Descriptions: `text-xl` (from `text-lg`) - Buttons: `text-lg` (from `text-[15px]`) - Form labels: `text-base` (from `text-[13px]`) - Input/textarea: `text-base` (from `text-[13px]`) - Status messages: `text-base` (from `text-[13px]`) **Responsive Design:** - Removed fixed width `w-[1440px]` - Added responsive container: `max-w-7xl mx-auto` - Two-column layout on large screens: `lg:w-1/2` for each panel - Mobile-first with breakpoints: `md:`, `lg:` prefixes - Responsive padding: `px-6 md:px-12 lg:px-24` **Cleaned Interface:** - Removed non-functional navigation links (API docs, GitHub) - Removed dummy footer links (features, pricing, tutorials) - Removed social media icons (Twitter, LinkedIn, GitHub) - Removed "Model Size" dropdown (non-functional) - Removed redundant "Choose File" button (upload area is clickable) - Simplified footer to logo + copyright only **Audio Preview:** - Added reference audio preview with play controls - Blob URL management with proper cleanup (useEffect) - Prevents memory leaks from unreleased object URLs ## Troubleshooting ### Docker Issues **Container won't start:** ```bash # Check logs docker logs qwen3-tts # Common issues: # 1. Port 7860 already in use docker ps | grep 7860 # 2. Model download failed (network issue) # 3. Insufficient memory (need 8-12GB RAM) ``` **Model not downloading:** - Check internet connection - Verify Hugging Face Hub is accessible - Try manual download: `huggingface-cli download Qwen/Qwen3-TTS-12Hz-1.7B-Base` **Frontend shows 404 on API calls:** - Verify Nginx is running: `docker exec qwen3-tts nginx -t` - Check backend is healthy: `docker exec qwen3-tts curl http://127.0.0.1:8000/api/status` - Review `API_URL` configuration in frontend ### Local Development Issues **Backend won't start:** - Check model exists: `ls models/Qwen3-TTS-12Hz-1.7B-Base/` - If missing, run `./setup.sh` or `./link_model.sh` - Verify Python version: `python --version` (need 3.12+) **Port already in use:** ```bash # Check what's using the port lsof -i :8000 # Backend lsof -i :3000 # Frontend (Yarn) lsof -i :7860 # Docker # Kill process or change port in configuration ``` **Frontend can't connect to backend:** - Verify backend is running: `curl http://localhost:8000/api/status` - Check CORS settings in `backend/main.py` - Ensure `API_URL` in `frontend/src/App.tsx` matches backend address - For network access, use `yarn dev --host` and update CORS origins **Model loading fails:** - Verify CUDA availability: `python -c "import torch; print(torch.cuda.is_available())"` - Check GPU memory: Should have ~4GB free - Try CPU mode: Set `USE_CPU=true` environment variable - CPU mode slower but works without GPU (8-12GB RAM needed)