Spaces:
Running on Zero
perf: engage ZeroGPU pack/stream, tune GPU duration, fix requirements
Browse filesZeroGPU eager-load: _maybe_eager_load() places the model at module scope
when SPACES_ZERO_GPU=1 so the spaces hijack packs weights at startup and
streams them into workers (fast cold starts). Off-ZeroGPU it stays lazy,
so importing the app never downloads the model. The previous lazy
@lru_cache load inside @spaces.GPU forfeited that path and paid a full
from_pretrained on every cold worker (verified via the ZeroGPU reference).
GPU duration: @spaces.GPU(duration=_estimate_duration) scales the
reservation with max_new_tokens * num_beams (capped 120s) instead of the
implicit 60s, improving queue priority and quota checks. Added a
perf_counter log in translate() to calibrate the estimate from live logs.
requirements.txt: drop gradio and spaces (installed by the Spaces base
image on every tier; listing them drifts the runtime) and pin torch to a
ZeroGPU-supported version (2.11.0) instead of >=2,<3 which can trip
CONFIG_ERROR. gradio/spaces moved to requirements-dev.txt for local runs.
README: add python_version: "3.12" (Spaces default is 3.10).
Tests: +3 fast (eager-load gating skip/run, duration estimator) -> 46
fast / 56 total. ruff + ty clean. Docs updated.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- CLAUDE.md +5 -5
- README.md +4 -3
- app.py +35 -2
- requirements-dev.txt +6 -0
- requirements.txt +7 -3
- tests/test_app.py +34 -0
|
@@ -25,8 +25,8 @@ uv run ruff format .
|
|
| 25 |
uv run ty check
|
| 26 |
|
| 27 |
# Test
|
| 28 |
-
uv run pytest # all
|
| 29 |
-
uv run pytest -m "not slow" #
|
| 30 |
uv run pytest -m slow # 10 model tests only (CUDA only)
|
| 31 |
|
| 32 |
# Generate language mapping (dev only)
|
|
@@ -35,19 +35,19 @@ uv run scripts/generate_langmap.py <path-to-paper.pdf>
|
|
| 35 |
|
| 36 |
## Architecture
|
| 37 |
|
| 38 |
-
**`app.py`** — Single-file application with a Google Translate-style layout: top row has two symmetric, filterable, region-sorted language dropdowns (source defaults to "English (en)", target defaults to "French (fr)") with a swap button ("⇄") between them; below that, input textbox and output textbox with copy button side by side. The Translate button spans full width below both textboxes (shows "Translating..." during processing). Ctrl+Enter submits from the input. The model auto-detects source language; the source dropdown is for user reference and the swap button only. Uses `@lru_cache` for lazy loading of the `google/madlad400-3b-mt` tokenizer and model (
|
| 39 |
|
| 40 |
**`langmap/`** — Package with `langid_mapping.py`, mapping 418 language tokens to `{"name": ..., "region": ...}` dicts. Auto-generated by `scripts/generate_langmap.py` from Table 9 (Section A.1) of the MADLAD-400 paper. Available languages at runtime are the intersection of this mapping and the model's vocabulary.
|
| 41 |
|
| 42 |
**`scripts/`** — `generate_langmap.py` parses the MADLAD-400 paper PDF (Table 9, pages 16-22) using pdfplumber and generates the static language mapping with region assignments. Dev-only tool; requires `requirements-dev.txt` dependencies.
|
| 43 |
|
| 44 |
-
**`tests/`** —
|
| 45 |
|
| 46 |
## Tooling
|
| 47 |
|
| 48 |
When working with Python, invoke the relevant `/astral:<skill>` for uv, ty, and ruff to ensure best practices are followed.
|
| 49 |
|
| 50 |
-
- **uv** — Python package manager. Used for venv creation and dependency installation
|
| 51 |
- **Ruff** — linter and formatter (`ruff.toml`). Rules: `E`, `F`, `I`, `UP`, `W`. Line length: 120.
|
| 52 |
- **ty** — type checker (`ty.toml`). Python 3.12 target.
|
| 53 |
- **pytest** — test runner (`pytest.ini`). Custom `slow` marker for CUDA-dependent tests.
|
|
|
|
| 25 |
uv run ty check
|
| 26 |
|
| 27 |
# Test
|
| 28 |
+
uv run pytest # all 56 tests (slow require CUDA + model download)
|
| 29 |
+
uv run pytest -m "not slow" # 46 fast tests only
|
| 30 |
uv run pytest -m slow # 10 model tests only (CUDA only)
|
| 31 |
|
| 32 |
# Generate language mapping (dev only)
|
|
|
|
| 35 |
|
| 36 |
## Architecture
|
| 37 |
|
| 38 |
+
**`app.py`** — Single-file application with a Google Translate-style layout: top row has two symmetric, filterable, region-sorted language dropdowns (source defaults to "English (en)", target defaults to "French (fr)") with a swap button ("⇄") between them; below that, input textbox (autofocused) and output textbox with copy button side by side. The Translate button spans full width below both textboxes (shows "Translating..." during processing). Ctrl+Enter submits from the input. The model auto-detects source language; the source dropdown is for user reference and the swap button only. Uses `@lru_cache` for lazy loading of the `google/madlad400-3b-mt` tokenizer and model. On ZeroGPU (`SPACES_ZERO_GPU=1`), `_maybe_eager_load()` places the model at module scope so the `spaces` hijack can pack weights and stream them into workers for fast cold starts; off-ZeroGPU (local, tests, cpu-basic) it stays lazy, so importing the app never downloads the model. Uses `float16` on CUDA, `float32` on CPU. MPS is not supported (produces garbage output with T5 models). Translation prepends a target language token with a space to the input text (e.g., `<2fr> Hello`) before tokenization and generation. The `@spaces.GPU` decorator allocates GPU on HF Spaces infrastructure; its `duration` is a callable (`_estimate_duration`) that scales the GPU reservation with `max_new_tokens × num_beams` (capped at 120s). The submit handler exposes a stable `/translate` API endpoint; the swap and Translate-button handlers are `api_visibility="private"`. Only `/translate` is public.
|
| 39 |
|
| 40 |
**`langmap/`** — Package with `langid_mapping.py`, mapping 418 language tokens to `{"name": ..., "region": ...}` dicts. Auto-generated by `scripts/generate_langmap.py` from Table 9 (Section A.1) of the MADLAD-400 paper. Available languages at runtime are the intersection of this mapping and the model's vocabulary.
|
| 41 |
|
| 42 |
**`scripts/`** — `generate_langmap.py` parses the MADLAD-400 paper PDF (Table 9, pages 16-22) using pdfplumber and generates the static language mapping with region assignments. Dev-only tool; requires `requirements-dev.txt` dependencies.
|
| 43 |
|
| 44 |
+
**`tests/`** — 56 tests (46 fast, 10 slow). `test_langmap.py` has 10 fast tests for mapping validation (dict shape, regions, spot-checks). `test_app.py` has 36 fast tests (signatures, device fallback, ZeroGPU eager-load gating, GPU duration estimator, UI layout with symmetric dropdowns, swap button, textbox config including toolbar buttons and input autofocus, handler wiring, stable `translate` API endpoint with UI-only handlers kept private, no HTML elements, locale codes, no title) and 10 slow tests (translation with various parameters, language mapping). Slow tests require CUDA and model download; auto-skipped without CUDA.
|
| 45 |
|
| 46 |
## Tooling
|
| 47 |
|
| 48 |
When working with Python, invoke the relevant `/astral:<skill>` for uv, ty, and ruff to ensure best practices are followed.
|
| 49 |
|
| 50 |
+
- **uv** — Python package manager. Used for venv creation and dependency installation. No `pyproject.toml` (HF Spaces requires `requirements.txt`). `requirements.txt` is the Spaces build manifest and omits `gradio`/`spaces` (provided by the Spaces runtime on every tier) and pins `torch` to a ZeroGPU-supported version; `requirements-dev.txt` adds `gradio`/`spaces` for local runs plus the dev tooling, so local setup installs both files.
|
| 51 |
- **Ruff** — linter and formatter (`ruff.toml`). Rules: `E`, `F`, `I`, `UP`, `W`. Line length: 120.
|
| 52 |
- **ty** — type checker (`ty.toml`). Python 3.12 target.
|
| 53 |
- **pytest** — test runner (`pytest.ini`). Custom `slow` marker for CUDA-dependent tests.
|
|
@@ -6,6 +6,7 @@ colorTo: green
|
|
| 6 |
sdk: gradio
|
| 7 |
sdk_version: 6.17.3
|
| 8 |
app_file: app.py
|
|
|
|
| 9 |
pinned: false
|
| 10 |
license: apache-2.0
|
| 11 |
short_description: Translate between 418 languages.
|
|
@@ -26,7 +27,7 @@ Translate between 418 languages from Table 9 (Section A.1) of Google's [MADLAD-4
|
|
| 26 |
```bash
|
| 27 |
uv venv --python 3.12
|
| 28 |
uv pip install -r requirements.txt
|
| 29 |
-
uv pip install -r requirements-dev.txt # dev tools
|
| 30 |
uv run app.py
|
| 31 |
```
|
| 32 |
|
|
@@ -38,6 +39,6 @@ The Gradio interface launches at `http://localhost:7860`.
|
|
| 38 |
uv run ruff check . # lint
|
| 39 |
uv run ruff format . # format
|
| 40 |
uv run ty check # type check
|
| 41 |
-
uv run pytest -m "not slow" #
|
| 42 |
-
uv run pytest # all
|
| 43 |
```
|
|
|
|
| 6 |
sdk: gradio
|
| 7 |
sdk_version: 6.17.3
|
| 8 |
app_file: app.py
|
| 9 |
+
python_version: "3.12"
|
| 10 |
pinned: false
|
| 11 |
license: apache-2.0
|
| 12 |
short_description: Translate between 418 languages.
|
|
|
|
| 27 |
```bash
|
| 28 |
uv venv --python 3.12
|
| 29 |
uv pip install -r requirements.txt
|
| 30 |
+
uv pip install -r requirements-dev.txt # local runtime (gradio, spaces) + dev tools
|
| 31 |
uv run app.py
|
| 32 |
```
|
| 33 |
|
|
|
|
| 39 |
uv run ruff check . # lint
|
| 40 |
uv run ruff format . # format
|
| 41 |
uv run ty check # type check
|
| 42 |
+
uv run pytest -m "not slow" # 46 fast tests
|
| 43 |
+
uv run pytest # all 56 tests (slow require CUDA + model download)
|
| 44 |
```
|
|
@@ -3,6 +3,8 @@ Translation interface using the MADLAD-400 3B model.
|
|
| 3 |
Translates between 418 languages from the MADLAD-400 paper.
|
| 4 |
"""
|
| 5 |
|
|
|
|
|
|
|
| 6 |
import warnings
|
| 7 |
from collections.abc import Generator
|
| 8 |
from functools import lru_cache
|
|
@@ -57,7 +59,33 @@ def _build_language_mappings() -> tuple[dict[str, str], list[str]]:
|
|
| 57 |
return name_to_code, sorted_names
|
| 58 |
|
| 59 |
|
| 60 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 61 |
def translate(
|
| 62 |
text: str,
|
| 63 |
target_language_name: str,
|
|
@@ -84,8 +112,12 @@ def translate(
|
|
| 84 |
generate_kwargs["do_sample"] = True
|
| 85 |
generate_kwargs["temperature"] = temperature
|
| 86 |
|
|
|
|
| 87 |
outputs = model.generate(**generate_kwargs)
|
| 88 |
-
|
|
|
|
|
|
|
|
|
|
| 89 |
|
| 90 |
|
| 91 |
def _translate_with_loading(
|
|
@@ -164,6 +196,7 @@ def _build_demo() -> gr.Blocks:
|
|
| 164 |
|
| 165 |
|
| 166 |
demo = _build_demo()
|
|
|
|
| 167 |
|
| 168 |
|
| 169 |
def main() -> None:
|
|
|
|
| 3 |
Translates between 418 languages from the MADLAD-400 paper.
|
| 4 |
"""
|
| 5 |
|
| 6 |
+
import os
|
| 7 |
+
import time
|
| 8 |
import warnings
|
| 9 |
from collections.abc import Generator
|
| 10 |
from functools import lru_cache
|
|
|
|
| 59 |
return name_to_code, sorted_names
|
| 60 |
|
| 61 |
|
| 62 |
+
def _maybe_eager_load() -> None:
|
| 63 |
+
"""On ZeroGPU, place the model at module scope so the ``spaces`` hijack can pack
|
| 64 |
+
weights to disk at startup and stream them into each worker's VRAM (fast cold
|
| 65 |
+
starts). Off-ZeroGPU (local, tests, cpu-basic) this is a no-op, so importing the
|
| 66 |
+
app never downloads the model. ``SPACES_ZERO_GPU`` is set only on ZeroGPU."""
|
| 67 |
+
if os.environ.get("SPACES_ZERO_GPU") == "1":
|
| 68 |
+
_load_tokenizer()
|
| 69 |
+
_load_model()
|
| 70 |
+
|
| 71 |
+
|
| 72 |
+
def _estimate_duration(
|
| 73 |
+
text: str,
|
| 74 |
+
target_language_name: str,
|
| 75 |
+
max_new_tokens: int = 512,
|
| 76 |
+
num_beams: int = 1,
|
| 77 |
+
temperature: float = 1.0,
|
| 78 |
+
) -> int:
|
| 79 |
+
"""Reserve GPU time scaled to the worst case: generation cost grows with the
|
| 80 |
+
number of tokens generated and the beam width. Mirrors translate()'s signature
|
| 81 |
+
(ZeroGPU calls the duration callable with the decorated function's args).
|
| 82 |
+
Conservative and capped at 120s; calibrate from the perf_counter log in
|
| 83 |
+
translate() (zerogpu.md 'Sizing duration')."""
|
| 84 |
+
del text, target_language_name, temperature # only token/beam counts drive runtime
|
| 85 |
+
return min(120, 30 + (max_new_tokens * num_beams) // 8)
|
| 86 |
+
|
| 87 |
+
|
| 88 |
+
@spaces.GPU(duration=_estimate_duration)
|
| 89 |
def translate(
|
| 90 |
text: str,
|
| 91 |
target_language_name: str,
|
|
|
|
| 112 |
generate_kwargs["do_sample"] = True
|
| 113 |
generate_kwargs["temperature"] = temperature
|
| 114 |
|
| 115 |
+
start = time.perf_counter()
|
| 116 |
outputs = model.generate(**generate_kwargs)
|
| 117 |
+
result = tokenizer.decode(outputs[0], skip_special_tokens=True)
|
| 118 |
+
elapsed = time.perf_counter() - start
|
| 119 |
+
print(f"[translate] max_new_tokens={max_new_tokens} num_beams={num_beams} took {elapsed:.1f}s")
|
| 120 |
+
return result
|
| 121 |
|
| 122 |
|
| 123 |
def _translate_with_loading(
|
|
|
|
| 196 |
|
| 197 |
|
| 198 |
demo = _build_demo()
|
| 199 |
+
_maybe_eager_load()
|
| 200 |
|
| 201 |
|
| 202 |
def main() -> None:
|
|
@@ -1,3 +1,9 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
pdfplumber>=0.11,<1
|
| 2 |
pytest>=9,<10
|
| 3 |
ruff>=0.11,<1
|
|
|
|
| 1 |
+
# Local-dev only (not read by HF Spaces). gradio and spaces are provided by the
|
| 2 |
+
# Spaces runtime on every tier, but must be installed to run the app locally.
|
| 3 |
+
gradio>=6,<7
|
| 4 |
+
spaces>=0.34,<1
|
| 5 |
+
|
| 6 |
+
# Dev tooling.
|
| 7 |
pdfplumber>=0.11,<1
|
| 8 |
pytest>=9,<10
|
| 9 |
ruff>=0.11,<1
|
|
@@ -1,7 +1,11 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
accelerate>=1,<2
|
| 2 |
-
gradio>=6,<7
|
| 3 |
sentencepiece>=0.2,<1
|
| 4 |
-
spaces>=0.34,<1
|
| 5 |
tokenizers>=0.21,<1
|
| 6 |
-
torch
|
| 7 |
transformers>=4,<5
|
|
|
|
| 1 |
+
# HF Spaces build manifest.
|
| 2 |
+
# gradio and spaces are intentionally NOT listed: the Gradio SDK base image installs
|
| 3 |
+
# both on every hardware tier (gradio is locked by `sdk_version` in README.md; `spaces`
|
| 4 |
+
# is platform-pinned by the ZeroGPU runtime). Listing either causes resolution failures
|
| 5 |
+
# or silently drifts the runtime. They live in requirements-dev.txt for local runs.
|
| 6 |
+
# torch is pinned to a ZeroGPU-supported version (accepted: 2.8.0/2.9.1/2.10.0/2.11.0).
|
| 7 |
accelerate>=1,<2
|
|
|
|
| 8 |
sentencepiece>=0.2,<1
|
|
|
|
| 9 |
tokenizers>=0.21,<1
|
| 10 |
+
torch==2.11.0
|
| 11 |
transformers>=4,<5
|
|
@@ -66,6 +66,40 @@ def test_get_device_warns_on_cpu():
|
|
| 66 |
app._get_device()
|
| 67 |
|
| 68 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 69 |
# --- UI component tests ---
|
| 70 |
|
| 71 |
|
|
|
|
| 66 |
app._get_device()
|
| 67 |
|
| 68 |
|
| 69 |
+
def test_maybe_eager_load_skipped_off_zerogpu(monkeypatch):
|
| 70 |
+
"""Off ZeroGPU, _maybe_eager_load() must not load the model (no download on import)."""
|
| 71 |
+
import app
|
| 72 |
+
|
| 73 |
+
monkeypatch.delenv("SPACES_ZERO_GPU", raising=False)
|
| 74 |
+
with patch.object(app, "_load_model") as load_model, patch.object(app, "_load_tokenizer") as load_tokenizer:
|
| 75 |
+
app._maybe_eager_load()
|
| 76 |
+
load_model.assert_not_called()
|
| 77 |
+
load_tokenizer.assert_not_called()
|
| 78 |
+
|
| 79 |
+
|
| 80 |
+
def test_maybe_eager_load_runs_on_zerogpu(monkeypatch):
|
| 81 |
+
"""On ZeroGPU (SPACES_ZERO_GPU=1), _maybe_eager_load() eagerly loads model + tokenizer."""
|
| 82 |
+
import app
|
| 83 |
+
|
| 84 |
+
monkeypatch.setenv("SPACES_ZERO_GPU", "1")
|
| 85 |
+
with patch.object(app, "_load_model") as load_model, patch.object(app, "_load_tokenizer") as load_tokenizer:
|
| 86 |
+
app._maybe_eager_load()
|
| 87 |
+
load_model.assert_called_once()
|
| 88 |
+
load_tokenizer.assert_called_once()
|
| 89 |
+
|
| 90 |
+
|
| 91 |
+
def test_estimate_duration_is_input_aware_and_capped():
|
| 92 |
+
"""Duration should scale with tokens*beams, give small inputs a smaller reservation, and cap at 120s."""
|
| 93 |
+
import app
|
| 94 |
+
|
| 95 |
+
small = app._estimate_duration("hi", "French (fr)", max_new_tokens=10, num_beams=1)
|
| 96 |
+
default = app._estimate_duration("hi", "French (fr)", max_new_tokens=512, num_beams=1)
|
| 97 |
+
heavy = app._estimate_duration("hi", "French (fr)", max_new_tokens=512, num_beams=8)
|
| 98 |
+
assert small < default <= 120
|
| 99 |
+
assert heavy == 120 # capped
|
| 100 |
+
assert all(isinstance(d, int) for d in (small, default, heavy))
|
| 101 |
+
|
| 102 |
+
|
| 103 |
# --- UI component tests ---
|
| 104 |
|
| 105 |
|