Spaces:

darylalim
/

madlad-400-translate

Running on Zero

Daryl Lim Claude Opus 4.8 (1M context) commited on 2 days ago

Commit

2ba774e

1 Parent(s): f8a5dc6

perf: engage ZeroGPU pack/stream, tune GPU duration, fix requirements

ZeroGPU eager-load: _maybe_eager_load() places the model at module scope
when SPACES_ZERO_GPU=1 so the spaces hijack packs weights at startup and
streams them into workers (fast cold starts). Off-ZeroGPU it stays lazy,
so importing the app never downloads the model. The previous lazy
@lru_cache load inside @spaces.GPU forfeited that path and paid a full
from_pretrained on every cold worker (verified via the ZeroGPU reference).

GPU duration: @spaces.GPU(duration=_estimate_duration) scales the
reservation with max_new_tokens * num_beams (capped 120s) instead of the
implicit 60s, improving queue priority and quota checks. Added a
perf_counter log in translate() to calibrate the estimate from live logs.

requirements.txt: drop gradio and spaces (installed by the Spaces base
image on every tier; listing them drifts the runtime) and pin torch to a
ZeroGPU-supported version (2.11.0) instead of >=2,<3 which can trip
CONFIG_ERROR. gradio/spaces moved to requirements-dev.txt for local runs.

README: add python_version: "3.12" (Spaces default is 3.10).

Tests: +3 fast (eager-load gating skip/run, duration estimator) -> 46
fast / 56 total. ruff + ty clean. Docs updated.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Files changed (6) hide show

CLAUDE.md +5 -5
README.md +4 -3
app.py +35 -2
requirements-dev.txt +6 -0
requirements.txt +7 -3
tests/test_app.py +34 -0

CLAUDE.md CHANGED Viewed

@@ -25,8 +25,8 @@ uv run ruff format .
 uv run ty check
 # Test
-uv run pytest                     # all 53 tests (slow require CUDA + model download)
-uv run pytest -m "not slow"       # 43 fast tests only
 uv run pytest -m slow             # 10 model tests only (CUDA only)
 # Generate language mapping (dev only)
@@ -35,19 +35,19 @@ uv run scripts/generate_langmap.py <path-to-paper.pdf>
 ## Architecture
-**`app.py`** — Single-file application with a Google Translate-style layout: top row has two symmetric, filterable, region-sorted language dropdowns (source defaults to "English (en)", target defaults to "French (fr)") with a swap button ("⇄") between them; below that, input textbox and output textbox with copy button side by side. The Translate button spans full width below both textboxes (shows "Translating..." during processing). Ctrl+Enter submits from the input. The model auto-detects source language; the source dropdown is for user reference and the swap button only. Uses `@lru_cache` for lazy loading of the `google/madlad400-3b-mt` tokenizer and model (no download on import). Uses `float16` on CUDA, `float32` on CPU. MPS is not supported (produces garbage output with T5 models). Translation prepends a target language token with a space to the input text (e.g., `<2fr> Hello`) before tokenization and generation. The `@spaces.GPU` decorator allocates GPU on HF Spaces infrastructure.
 **`langmap/`** — Package with `langid_mapping.py`, mapping 418 language tokens to `{"name": ..., "region": ...}` dicts. Auto-generated by `scripts/generate_langmap.py` from Table 9 (Section A.1) of the MADLAD-400 paper. Available languages at runtime are the intersection of this mapping and the model's vocabulary.
 **`scripts/`** — `generate_langmap.py` parses the MADLAD-400 paper PDF (Table 9, pages 16-22) using pdfplumber and generates the static language mapping with region assignments. Dev-only tool; requires `requirements-dev.txt` dependencies.
-**`tests/`** — 53 tests (43 fast, 10 slow). `test_langmap.py` has 10 fast tests for mapping validation (dict shape, regions, spot-checks). `test_app.py` has 33 fast tests (signatures, device fallback, UI layout with symmetric dropdowns, swap button, textbox config including toolbar buttons and input autofocus, handler wiring, stable `translate` API endpoint with UI-only handlers kept private, no HTML elements, locale codes, no title) and 10 slow tests (translation with various parameters, language mapping). Slow tests require CUDA and model download; auto-skipped without CUDA.
 ## Tooling
 When working with Python, invoke the relevant `/astral:<skill>` for uv, ty, and ruff to ensure best practices are followed.
-- **uv** — Python package manager. Used for venv creation and dependency installation from `requirements.txt`. No `pyproject.toml`; `requirements.txt` remains the single source of truth (required by HF Spaces).
 - **Ruff** — linter and formatter (`ruff.toml`). Rules: `E`, `F`, `I`, `UP`, `W`. Line length: 120.
 - **ty** — type checker (`ty.toml`). Python 3.12 target.
 - **pytest** — test runner (`pytest.ini`). Custom `slow` marker for CUDA-dependent tests.

 uv run ty check
 # Test
+uv run pytest                     # all 56 tests (slow require CUDA + model download)
+uv run pytest -m "not slow"       # 46 fast tests only
 uv run pytest -m slow             # 10 model tests only (CUDA only)
 # Generate language mapping (dev only)
 ## Architecture
+**`app.py`** — Single-file application with a Google Translate-style layout: top row has two symmetric, filterable, region-sorted language dropdowns (source defaults to "English (en)", target defaults to "French (fr)") with a swap button ("⇄") between them; below that, input textbox (autofocused) and output textbox with copy button side by side. The Translate button spans full width below both textboxes (shows "Translating..." during processing). Ctrl+Enter submits from the input. The model auto-detects source language; the source dropdown is for user reference and the swap button only. Uses `@lru_cache` for lazy loading of the `google/madlad400-3b-mt` tokenizer and model. On ZeroGPU (`SPACES_ZERO_GPU=1`), `_maybe_eager_load()` places the model at module scope so the `spaces` hijack can pack weights and stream them into workers for fast cold starts; off-ZeroGPU (local, tests, cpu-basic) it stays lazy, so importing the app never downloads the model. Uses `float16` on CUDA, `float32` on CPU. MPS is not supported (produces garbage output with T5 models). Translation prepends a target language token with a space to the input text (e.g., `<2fr> Hello`) before tokenization and generation. The `@spaces.GPU` decorator allocates GPU on HF Spaces infrastructure; its `duration` is a callable (`_estimate_duration`) that scales the GPU reservation with `max_new_tokens × num_beams` (capped at 120s). The submit handler exposes a stable `/translate` API endpoint; the swap and Translate-button handlers are `api_visibility="private"`. Only `/translate` is public.
 **`langmap/`** — Package with `langid_mapping.py`, mapping 418 language tokens to `{"name": ..., "region": ...}` dicts. Auto-generated by `scripts/generate_langmap.py` from Table 9 (Section A.1) of the MADLAD-400 paper. Available languages at runtime are the intersection of this mapping and the model's vocabulary.
 **`scripts/`** — `generate_langmap.py` parses the MADLAD-400 paper PDF (Table 9, pages 16-22) using pdfplumber and generates the static language mapping with region assignments. Dev-only tool; requires `requirements-dev.txt` dependencies.
+**`tests/`** — 56 tests (46 fast, 10 slow). `test_langmap.py` has 10 fast tests for mapping validation (dict shape, regions, spot-checks). `test_app.py` has 36 fast tests (signatures, device fallback, ZeroGPU eager-load gating, GPU duration estimator, UI layout with symmetric dropdowns, swap button, textbox config including toolbar buttons and input autofocus, handler wiring, stable `translate` API endpoint with UI-only handlers kept private, no HTML elements, locale codes, no title) and 10 slow tests (translation with various parameters, language mapping). Slow tests require CUDA and model download; auto-skipped without CUDA.
 ## Tooling
 When working with Python, invoke the relevant `/astral:<skill>` for uv, ty, and ruff to ensure best practices are followed.
+- **uv** — Python package manager. Used for venv creation and dependency installation. No `pyproject.toml` (HF Spaces requires `requirements.txt`). `requirements.txt` is the Spaces build manifest and omits `gradio`/`spaces` (provided by the Spaces runtime on every tier) and pins `torch` to a ZeroGPU-supported version; `requirements-dev.txt` adds `gradio`/`spaces` for local runs plus the dev tooling, so local setup installs both files.
 - **Ruff** — linter and formatter (`ruff.toml`). Rules: `E`, `F`, `I`, `UP`, `W`. Line length: 120.
 - **ty** — type checker (`ty.toml`). Python 3.12 target.
 - **pytest** — test runner (`pytest.ini`). Custom `slow` marker for CUDA-dependent tests.

README.md CHANGED Viewed

@@ -6,6 +6,7 @@ colorTo: green
 sdk: gradio
 sdk_version: 6.17.3
 app_file: app.py
 pinned: false
 license: apache-2.0
 short_description: Translate between 418 languages.
@@ -26,7 +27,7 @@ Translate between 418 languages from Table 9 (Section A.1) of Google's [MADLAD-4
 ```bash
 uv venv --python 3.12
 uv pip install -r requirements.txt
-uv pip install -r requirements-dev.txt   # dev tools
 uv run app.py
 ```
@@ -38,6 +39,6 @@ The Gradio interface launches at `http://localhost:7860`.
 uv run ruff check .             # lint
 uv run ruff format .            # format
 uv run ty check                 # type check
-uv run pytest -m "not slow"     # 40 fast tests
-uv run pytest                   # all 50 tests (slow require CUDA + model download)
 ```

 sdk: gradio
 sdk_version: 6.17.3
 app_file: app.py
+python_version: "3.12"
 pinned: false
 license: apache-2.0
 short_description: Translate between 418 languages.
 ```bash
 uv venv --python 3.12
 uv pip install -r requirements.txt
+uv pip install -r requirements-dev.txt   # local runtime (gradio, spaces) + dev tools
 uv run app.py
 ```
 uv run ruff check .             # lint
 uv run ruff format .            # format
 uv run ty check                 # type check
+uv run pytest -m "not slow"     # 46 fast tests
+uv run pytest                   # all 56 tests (slow require CUDA + model download)
 ```

app.py CHANGED Viewed

@@ -3,6 +3,8 @@ Translation interface using the MADLAD-400 3B model.
 Translates between 418 languages from the MADLAD-400 paper.
 """
 import warnings
 from collections.abc import Generator
 from functools import lru_cache
@@ -57,7 +59,33 @@ def _build_language_mappings() -> tuple[dict[str, str], list[str]]:
     return name_to_code, sorted_names
-@spaces.GPU
 def translate(
     text: str,
     target_language_name: str,
@@ -84,8 +112,12 @@ def translate(
         generate_kwargs["do_sample"] = True
         generate_kwargs["temperature"] = temperature
     outputs = model.generate(**generate_kwargs)
-    return tokenizer.decode(outputs[0], skip_special_tokens=True)
 def _translate_with_loading(
@@ -164,6 +196,7 @@ def _build_demo() -> gr.Blocks:
 demo = _build_demo()
 def main() -> None:

 Translates between 418 languages from the MADLAD-400 paper.
 """
+import os
+import time
 import warnings
 from collections.abc import Generator
 from functools import lru_cache
     return name_to_code, sorted_names
+def _maybe_eager_load() -> None:
+    """On ZeroGPU, place the model at module scope so the ``spaces`` hijack can pack
+    weights to disk at startup and stream them into each worker's VRAM (fast cold
+    starts). Off-ZeroGPU (local, tests, cpu-basic) this is a no-op, so importing the
+    app never downloads the model. ``SPACES_ZERO_GPU`` is set only on ZeroGPU."""
+    if os.environ.get("SPACES_ZERO_GPU") == "1":
+        _load_tokenizer()
+        _load_model()
+def _estimate_duration(
+    text: str,
+    target_language_name: str,
+    max_new_tokens: int = 512,
+    num_beams: int = 1,
+    temperature: float = 1.0,
+) -> int:
+    """Reserve GPU time scaled to the worst case: generation cost grows with the
+    number of tokens generated and the beam width. Mirrors translate()'s signature
+    (ZeroGPU calls the duration callable with the decorated function's args).
+    Conservative and capped at 120s; calibrate from the perf_counter log in
+    translate() (zerogpu.md 'Sizing duration')."""
+    del text, target_language_name, temperature  # only token/beam counts drive runtime
+    return min(120, 30 + (max_new_tokens * num_beams) // 8)
+@spaces.GPU(duration=_estimate_duration)
 def translate(
     text: str,
     target_language_name: str,
         generate_kwargs["do_sample"] = True
         generate_kwargs["temperature"] = temperature
+    start = time.perf_counter()
     outputs = model.generate(**generate_kwargs)
+    result = tokenizer.decode(outputs[0], skip_special_tokens=True)
+    elapsed = time.perf_counter() - start
+    print(f"[translate] max_new_tokens={max_new_tokens} num_beams={num_beams} took {elapsed:.1f}s")
+    return result
 def _translate_with_loading(
 demo = _build_demo()
+_maybe_eager_load()
 def main() -> None:

requirements-dev.txt CHANGED Viewed

@@ -1,3 +1,9 @@
 pdfplumber>=0.11,<1
 pytest>=9,<10
 ruff>=0.11,<1

+# Local-dev only (not read by HF Spaces). gradio and spaces are provided by the
+# Spaces runtime on every tier, but must be installed to run the app locally.
+gradio>=6,<7
+spaces>=0.34,<1
+# Dev tooling.
 pdfplumber>=0.11,<1
 pytest>=9,<10
 ruff>=0.11,<1

requirements.txt CHANGED Viewed

@@ -1,7 +1,11 @@
 accelerate>=1,<2
-gradio>=6,<7
 sentencepiece>=0.2,<1
-spaces>=0.34,<1
 tokenizers>=0.21,<1
-torch>=2,<3
 transformers>=4,<5

+# HF Spaces build manifest.
+# gradio and spaces are intentionally NOT listed: the Gradio SDK base image installs
+# both on every hardware tier (gradio is locked by `sdk_version` in README.md; `spaces`
+# is platform-pinned by the ZeroGPU runtime). Listing either causes resolution failures
+# or silently drifts the runtime. They live in requirements-dev.txt for local runs.
+# torch is pinned to a ZeroGPU-supported version (accepted: 2.8.0/2.9.1/2.10.0/2.11.0).
 accelerate>=1,<2
 sentencepiece>=0.2,<1
 tokenizers>=0.21,<1
+torch==2.11.0
 transformers>=4,<5

tests/test_app.py CHANGED Viewed

@@ -66,6 +66,40 @@ def test_get_device_warns_on_cpu():
             app._get_device()
 # --- UI component tests ---

             app._get_device()
+def test_maybe_eager_load_skipped_off_zerogpu(monkeypatch):
+    """Off ZeroGPU, _maybe_eager_load() must not load the model (no download on import)."""
+    import app
+    monkeypatch.delenv("SPACES_ZERO_GPU", raising=False)
+    with patch.object(app, "_load_model") as load_model, patch.object(app, "_load_tokenizer") as load_tokenizer:
+        app._maybe_eager_load()
+    load_model.assert_not_called()
+    load_tokenizer.assert_not_called()
+def test_maybe_eager_load_runs_on_zerogpu(monkeypatch):
+    """On ZeroGPU (SPACES_ZERO_GPU=1), _maybe_eager_load() eagerly loads model + tokenizer."""
+    import app
+    monkeypatch.setenv("SPACES_ZERO_GPU", "1")
+    with patch.object(app, "_load_model") as load_model, patch.object(app, "_load_tokenizer") as load_tokenizer:
+        app._maybe_eager_load()
+    load_model.assert_called_once()
+    load_tokenizer.assert_called_once()
+def test_estimate_duration_is_input_aware_and_capped():
+    """Duration should scale with tokens*beams, give small inputs a smaller reservation, and cap at 120s."""
+    import app
+    small = app._estimate_duration("hi", "French (fr)", max_new_tokens=10, num_beams=1)
+    default = app._estimate_duration("hi", "French (fr)", max_new_tokens=512, num_beams=1)
+    heavy = app._estimate_duration("hi", "French (fr)", max_new_tokens=512, num_beams=8)
+    assert small < default <= 120
+    assert heavy == 120  # capped
+    assert all(isinstance(d, int) for d in (small, default, heavy))
 # --- UI component tests ---