Spaces:

darylalim
/

madlad-400-translate

Running on Zero

App Files Files Community

madlad-400-translate / CLAUDE.md

Daryl Lim

docs: update CLAUDE.md test counts and setup commands

fc98a20 2 months ago

preview code

Raw

History Blame Contribute Delete

3.54 kB

A newer version of the Gradio SDK is available: 6.19.0

Upgrade

CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Project Overview

A Hugging Face Spaces app that translates between 418 languages from Table 9 (Section A.1) of Google's MADLAD-400 3B Seq2Seq model. Built with Gradio and deployed on HF Spaces. Falls back to CPU with a warning when no CUDA GPU is available.

Commands

# Setup
uv venv --python 3.12
source .venv/bin/activate
uv pip install -r requirements.txt
uv pip install -r requirements-dev.txt

# Run (launches on http://localhost:7860)
python app.py

# Lint and format
ruff check .
ruff format .

# Type check
ty check

# Test
pytest                     # all 48 tests (slow require CUDA + model download)
pytest -m "not slow"       # 38 fast tests only
pytest -m slow             # 10 model tests only (CUDA only)

# Generate language mapping (dev only)
python scripts/generate_langmap.py <path-to-paper.pdf>

Architecture

app.py — Single-file application with a Google Translate-style layout: top row has two symmetric, filterable, region-sorted language dropdowns (source defaults to "English (en)", target defaults to "French (fr)") with a swap button ("⇄") between them; below that, input textbox with inline clear button and output textbox with copy button side by side. The Translate button spans full width below both textboxes (shows "Translating..." during processing). Ctrl+Enter submits from the input. The model auto-detects source language; the source dropdown is for user reference and the swap button only. Uses @lru_cache for lazy loading of the google/madlad400-3b-mt tokenizer and model (no download on import). Uses float16 on CUDA, float32 on CPU. MPS is not supported (produces garbage output with T5 models). Translation prepends a target language token with a space to the input text (e.g., <2fr> Hello) before tokenization and generation. The @spaces.GPU decorator allocates GPU on HF Spaces infrastructure.

langmap/ — Package with langid_mapping.py, mapping 418 language tokens to {"name": ..., "region": ...} dicts. Auto-generated by scripts/generate_langmap.py from Table 9 (Section A.1) of the MADLAD-400 paper. Available languages at runtime are the intersection of this mapping and the model's vocabulary.

scripts/ — generate_langmap.py parses the MADLAD-400 paper PDF (Table 9, pages 16-22) using pdfplumber and generates the static language mapping with region assignments. Dev-only tool; requires requirements-dev.txt dependencies.

tests/ — 48 tests (38 fast, 10 slow). test_langmap.py has 10 fast tests for mapping validation (dict shape, regions, spot-checks). test_app.py has 28 fast tests (signatures, device fallback, UI layout with symmetric dropdowns, swap button, textbox config, handler wiring, no HTML elements, locale codes, no title) and 10 slow tests (translation with various parameters, language mapping). Slow tests require CUDA and model download; auto-skipped without CUDA.

Tooling

uv — Python package manager. Used for venv creation and dependency installation from requirements.txt. No pyproject.toml; requirements.txt remains the single source of truth (required by HF Spaces).
Ruff — linter and formatter (ruff.toml). Rules: E, F, I, W. Line length: 120.
ty — type checker (ty.toml). Python 3.12 target.
pytest — test runner (pytest.ini). Custom slow marker for CUDA-dependent tests.