Daryl Lim Claude Opus 4.8 (1M context) commited on
Commit
2ba774e
·
1 Parent(s): f8a5dc6

perf: engage ZeroGPU pack/stream, tune GPU duration, fix requirements

Browse files

ZeroGPU eager-load: _maybe_eager_load() places the model at module scope
when SPACES_ZERO_GPU=1 so the spaces hijack packs weights at startup and
streams them into workers (fast cold starts). Off-ZeroGPU it stays lazy,
so importing the app never downloads the model. The previous lazy
@lru_cache load inside @spaces.GPU forfeited that path and paid a full
from_pretrained on every cold worker (verified via the ZeroGPU reference).

GPU duration: @spaces.GPU(duration=_estimate_duration) scales the
reservation with max_new_tokens * num_beams (capped 120s) instead of the
implicit 60s, improving queue priority and quota checks. Added a
perf_counter log in translate() to calibrate the estimate from live logs.

requirements.txt: drop gradio and spaces (installed by the Spaces base
image on every tier; listing them drifts the runtime) and pin torch to a
ZeroGPU-supported version (2.11.0) instead of >=2,<3 which can trip
CONFIG_ERROR. gradio/spaces moved to requirements-dev.txt for local runs.

README: add python_version: "3.12" (Spaces default is 3.10).

Tests: +3 fast (eager-load gating skip/run, duration estimator) -> 46
fast / 56 total. ruff + ty clean. Docs updated.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Files changed (6) hide show
  1. CLAUDE.md +5 -5
  2. README.md +4 -3
  3. app.py +35 -2
  4. requirements-dev.txt +6 -0
  5. requirements.txt +7 -3
  6. tests/test_app.py +34 -0
CLAUDE.md CHANGED
@@ -25,8 +25,8 @@ uv run ruff format .
25
  uv run ty check
26
 
27
  # Test
28
- uv run pytest # all 53 tests (slow require CUDA + model download)
29
- uv run pytest -m "not slow" # 43 fast tests only
30
  uv run pytest -m slow # 10 model tests only (CUDA only)
31
 
32
  # Generate language mapping (dev only)
@@ -35,19 +35,19 @@ uv run scripts/generate_langmap.py <path-to-paper.pdf>
35
 
36
  ## Architecture
37
 
38
- **`app.py`** — Single-file application with a Google Translate-style layout: top row has two symmetric, filterable, region-sorted language dropdowns (source defaults to "English (en)", target defaults to "French (fr)") with a swap button ("⇄") between them; below that, input textbox and output textbox with copy button side by side. The Translate button spans full width below both textboxes (shows "Translating..." during processing). Ctrl+Enter submits from the input. The model auto-detects source language; the source dropdown is for user reference and the swap button only. Uses `@lru_cache` for lazy loading of the `google/madlad400-3b-mt` tokenizer and model (no download on import). Uses `float16` on CUDA, `float32` on CPU. MPS is not supported (produces garbage output with T5 models). Translation prepends a target language token with a space to the input text (e.g., `<2fr> Hello`) before tokenization and generation. The `@spaces.GPU` decorator allocates GPU on HF Spaces infrastructure.
39
 
40
  **`langmap/`** — Package with `langid_mapping.py`, mapping 418 language tokens to `{"name": ..., "region": ...}` dicts. Auto-generated by `scripts/generate_langmap.py` from Table 9 (Section A.1) of the MADLAD-400 paper. Available languages at runtime are the intersection of this mapping and the model's vocabulary.
41
 
42
  **`scripts/`** — `generate_langmap.py` parses the MADLAD-400 paper PDF (Table 9, pages 16-22) using pdfplumber and generates the static language mapping with region assignments. Dev-only tool; requires `requirements-dev.txt` dependencies.
43
 
44
- **`tests/`** — 53 tests (43 fast, 10 slow). `test_langmap.py` has 10 fast tests for mapping validation (dict shape, regions, spot-checks). `test_app.py` has 33 fast tests (signatures, device fallback, UI layout with symmetric dropdowns, swap button, textbox config including toolbar buttons and input autofocus, handler wiring, stable `translate` API endpoint with UI-only handlers kept private, no HTML elements, locale codes, no title) and 10 slow tests (translation with various parameters, language mapping). Slow tests require CUDA and model download; auto-skipped without CUDA.
45
 
46
  ## Tooling
47
 
48
  When working with Python, invoke the relevant `/astral:<skill>` for uv, ty, and ruff to ensure best practices are followed.
49
 
50
- - **uv** — Python package manager. Used for venv creation and dependency installation from `requirements.txt`. No `pyproject.toml`; `requirements.txt` remains the single source of truth (required by HF Spaces).
51
  - **Ruff** — linter and formatter (`ruff.toml`). Rules: `E`, `F`, `I`, `UP`, `W`. Line length: 120.
52
  - **ty** — type checker (`ty.toml`). Python 3.12 target.
53
  - **pytest** — test runner (`pytest.ini`). Custom `slow` marker for CUDA-dependent tests.
 
25
  uv run ty check
26
 
27
  # Test
28
+ uv run pytest # all 56 tests (slow require CUDA + model download)
29
+ uv run pytest -m "not slow" # 46 fast tests only
30
  uv run pytest -m slow # 10 model tests only (CUDA only)
31
 
32
  # Generate language mapping (dev only)
 
35
 
36
  ## Architecture
37
 
38
+ **`app.py`** — Single-file application with a Google Translate-style layout: top row has two symmetric, filterable, region-sorted language dropdowns (source defaults to "English (en)", target defaults to "French (fr)") with a swap button ("⇄") between them; below that, input textbox (autofocused) and output textbox with copy button side by side. The Translate button spans full width below both textboxes (shows "Translating..." during processing). Ctrl+Enter submits from the input. The model auto-detects source language; the source dropdown is for user reference and the swap button only. Uses `@lru_cache` for lazy loading of the `google/madlad400-3b-mt` tokenizer and model. On ZeroGPU (`SPACES_ZERO_GPU=1`), `_maybe_eager_load()` places the model at module scope so the `spaces` hijack can pack weights and stream them into workers for fast cold starts; off-ZeroGPU (local, tests, cpu-basic) it stays lazy, so importing the app never downloads the model. Uses `float16` on CUDA, `float32` on CPU. MPS is not supported (produces garbage output with T5 models). Translation prepends a target language token with a space to the input text (e.g., `<2fr> Hello`) before tokenization and generation. The `@spaces.GPU` decorator allocates GPU on HF Spaces infrastructure; its `duration` is a callable (`_estimate_duration`) that scales the GPU reservation with `max_new_tokens × num_beams` (capped at 120s). The submit handler exposes a stable `/translate` API endpoint; the swap and Translate-button handlers are `api_visibility="private"`. Only `/translate` is public.
39
 
40
  **`langmap/`** — Package with `langid_mapping.py`, mapping 418 language tokens to `{"name": ..., "region": ...}` dicts. Auto-generated by `scripts/generate_langmap.py` from Table 9 (Section A.1) of the MADLAD-400 paper. Available languages at runtime are the intersection of this mapping and the model's vocabulary.
41
 
42
  **`scripts/`** — `generate_langmap.py` parses the MADLAD-400 paper PDF (Table 9, pages 16-22) using pdfplumber and generates the static language mapping with region assignments. Dev-only tool; requires `requirements-dev.txt` dependencies.
43
 
44
+ **`tests/`** — 56 tests (46 fast, 10 slow). `test_langmap.py` has 10 fast tests for mapping validation (dict shape, regions, spot-checks). `test_app.py` has 36 fast tests (signatures, device fallback, ZeroGPU eager-load gating, GPU duration estimator, UI layout with symmetric dropdowns, swap button, textbox config including toolbar buttons and input autofocus, handler wiring, stable `translate` API endpoint with UI-only handlers kept private, no HTML elements, locale codes, no title) and 10 slow tests (translation with various parameters, language mapping). Slow tests require CUDA and model download; auto-skipped without CUDA.
45
 
46
  ## Tooling
47
 
48
  When working with Python, invoke the relevant `/astral:<skill>` for uv, ty, and ruff to ensure best practices are followed.
49
 
50
+ - **uv** — Python package manager. Used for venv creation and dependency installation. No `pyproject.toml` (HF Spaces requires `requirements.txt`). `requirements.txt` is the Spaces build manifest and omits `gradio`/`spaces` (provided by the Spaces runtime on every tier) and pins `torch` to a ZeroGPU-supported version; `requirements-dev.txt` adds `gradio`/`spaces` for local runs plus the dev tooling, so local setup installs both files.
51
  - **Ruff** — linter and formatter (`ruff.toml`). Rules: `E`, `F`, `I`, `UP`, `W`. Line length: 120.
52
  - **ty** — type checker (`ty.toml`). Python 3.12 target.
53
  - **pytest** — test runner (`pytest.ini`). Custom `slow` marker for CUDA-dependent tests.
README.md CHANGED
@@ -6,6 +6,7 @@ colorTo: green
6
  sdk: gradio
7
  sdk_version: 6.17.3
8
  app_file: app.py
 
9
  pinned: false
10
  license: apache-2.0
11
  short_description: Translate between 418 languages.
@@ -26,7 +27,7 @@ Translate between 418 languages from Table 9 (Section A.1) of Google's [MADLAD-4
26
  ```bash
27
  uv venv --python 3.12
28
  uv pip install -r requirements.txt
29
- uv pip install -r requirements-dev.txt # dev tools
30
  uv run app.py
31
  ```
32
 
@@ -38,6 +39,6 @@ The Gradio interface launches at `http://localhost:7860`.
38
  uv run ruff check . # lint
39
  uv run ruff format . # format
40
  uv run ty check # type check
41
- uv run pytest -m "not slow" # 40 fast tests
42
- uv run pytest # all 50 tests (slow require CUDA + model download)
43
  ```
 
6
  sdk: gradio
7
  sdk_version: 6.17.3
8
  app_file: app.py
9
+ python_version: "3.12"
10
  pinned: false
11
  license: apache-2.0
12
  short_description: Translate between 418 languages.
 
27
  ```bash
28
  uv venv --python 3.12
29
  uv pip install -r requirements.txt
30
+ uv pip install -r requirements-dev.txt # local runtime (gradio, spaces) + dev tools
31
  uv run app.py
32
  ```
33
 
 
39
  uv run ruff check . # lint
40
  uv run ruff format . # format
41
  uv run ty check # type check
42
+ uv run pytest -m "not slow" # 46 fast tests
43
+ uv run pytest # all 56 tests (slow require CUDA + model download)
44
  ```
app.py CHANGED
@@ -3,6 +3,8 @@ Translation interface using the MADLAD-400 3B model.
3
  Translates between 418 languages from the MADLAD-400 paper.
4
  """
5
 
 
 
6
  import warnings
7
  from collections.abc import Generator
8
  from functools import lru_cache
@@ -57,7 +59,33 @@ def _build_language_mappings() -> tuple[dict[str, str], list[str]]:
57
  return name_to_code, sorted_names
58
 
59
 
60
- @spaces.GPU
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
61
  def translate(
62
  text: str,
63
  target_language_name: str,
@@ -84,8 +112,12 @@ def translate(
84
  generate_kwargs["do_sample"] = True
85
  generate_kwargs["temperature"] = temperature
86
 
 
87
  outputs = model.generate(**generate_kwargs)
88
- return tokenizer.decode(outputs[0], skip_special_tokens=True)
 
 
 
89
 
90
 
91
  def _translate_with_loading(
@@ -164,6 +196,7 @@ def _build_demo() -> gr.Blocks:
164
 
165
 
166
  demo = _build_demo()
 
167
 
168
 
169
  def main() -> None:
 
3
  Translates between 418 languages from the MADLAD-400 paper.
4
  """
5
 
6
+ import os
7
+ import time
8
  import warnings
9
  from collections.abc import Generator
10
  from functools import lru_cache
 
59
  return name_to_code, sorted_names
60
 
61
 
62
+ def _maybe_eager_load() -> None:
63
+ """On ZeroGPU, place the model at module scope so the ``spaces`` hijack can pack
64
+ weights to disk at startup and stream them into each worker's VRAM (fast cold
65
+ starts). Off-ZeroGPU (local, tests, cpu-basic) this is a no-op, so importing the
66
+ app never downloads the model. ``SPACES_ZERO_GPU`` is set only on ZeroGPU."""
67
+ if os.environ.get("SPACES_ZERO_GPU") == "1":
68
+ _load_tokenizer()
69
+ _load_model()
70
+
71
+
72
+ def _estimate_duration(
73
+ text: str,
74
+ target_language_name: str,
75
+ max_new_tokens: int = 512,
76
+ num_beams: int = 1,
77
+ temperature: float = 1.0,
78
+ ) -> int:
79
+ """Reserve GPU time scaled to the worst case: generation cost grows with the
80
+ number of tokens generated and the beam width. Mirrors translate()'s signature
81
+ (ZeroGPU calls the duration callable with the decorated function's args).
82
+ Conservative and capped at 120s; calibrate from the perf_counter log in
83
+ translate() (zerogpu.md 'Sizing duration')."""
84
+ del text, target_language_name, temperature # only token/beam counts drive runtime
85
+ return min(120, 30 + (max_new_tokens * num_beams) // 8)
86
+
87
+
88
+ @spaces.GPU(duration=_estimate_duration)
89
  def translate(
90
  text: str,
91
  target_language_name: str,
 
112
  generate_kwargs["do_sample"] = True
113
  generate_kwargs["temperature"] = temperature
114
 
115
+ start = time.perf_counter()
116
  outputs = model.generate(**generate_kwargs)
117
+ result = tokenizer.decode(outputs[0], skip_special_tokens=True)
118
+ elapsed = time.perf_counter() - start
119
+ print(f"[translate] max_new_tokens={max_new_tokens} num_beams={num_beams} took {elapsed:.1f}s")
120
+ return result
121
 
122
 
123
  def _translate_with_loading(
 
196
 
197
 
198
  demo = _build_demo()
199
+ _maybe_eager_load()
200
 
201
 
202
  def main() -> None:
requirements-dev.txt CHANGED
@@ -1,3 +1,9 @@
 
 
 
 
 
 
1
  pdfplumber>=0.11,<1
2
  pytest>=9,<10
3
  ruff>=0.11,<1
 
1
+ # Local-dev only (not read by HF Spaces). gradio and spaces are provided by the
2
+ # Spaces runtime on every tier, but must be installed to run the app locally.
3
+ gradio>=6,<7
4
+ spaces>=0.34,<1
5
+
6
+ # Dev tooling.
7
  pdfplumber>=0.11,<1
8
  pytest>=9,<10
9
  ruff>=0.11,<1
requirements.txt CHANGED
@@ -1,7 +1,11 @@
 
 
 
 
 
 
1
  accelerate>=1,<2
2
- gradio>=6,<7
3
  sentencepiece>=0.2,<1
4
- spaces>=0.34,<1
5
  tokenizers>=0.21,<1
6
- torch>=2,<3
7
  transformers>=4,<5
 
1
+ # HF Spaces build manifest.
2
+ # gradio and spaces are intentionally NOT listed: the Gradio SDK base image installs
3
+ # both on every hardware tier (gradio is locked by `sdk_version` in README.md; `spaces`
4
+ # is platform-pinned by the ZeroGPU runtime). Listing either causes resolution failures
5
+ # or silently drifts the runtime. They live in requirements-dev.txt for local runs.
6
+ # torch is pinned to a ZeroGPU-supported version (accepted: 2.8.0/2.9.1/2.10.0/2.11.0).
7
  accelerate>=1,<2
 
8
  sentencepiece>=0.2,<1
 
9
  tokenizers>=0.21,<1
10
+ torch==2.11.0
11
  transformers>=4,<5
tests/test_app.py CHANGED
@@ -66,6 +66,40 @@ def test_get_device_warns_on_cpu():
66
  app._get_device()
67
 
68
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
69
  # --- UI component tests ---
70
 
71
 
 
66
  app._get_device()
67
 
68
 
69
+ def test_maybe_eager_load_skipped_off_zerogpu(monkeypatch):
70
+ """Off ZeroGPU, _maybe_eager_load() must not load the model (no download on import)."""
71
+ import app
72
+
73
+ monkeypatch.delenv("SPACES_ZERO_GPU", raising=False)
74
+ with patch.object(app, "_load_model") as load_model, patch.object(app, "_load_tokenizer") as load_tokenizer:
75
+ app._maybe_eager_load()
76
+ load_model.assert_not_called()
77
+ load_tokenizer.assert_not_called()
78
+
79
+
80
+ def test_maybe_eager_load_runs_on_zerogpu(monkeypatch):
81
+ """On ZeroGPU (SPACES_ZERO_GPU=1), _maybe_eager_load() eagerly loads model + tokenizer."""
82
+ import app
83
+
84
+ monkeypatch.setenv("SPACES_ZERO_GPU", "1")
85
+ with patch.object(app, "_load_model") as load_model, patch.object(app, "_load_tokenizer") as load_tokenizer:
86
+ app._maybe_eager_load()
87
+ load_model.assert_called_once()
88
+ load_tokenizer.assert_called_once()
89
+
90
+
91
+ def test_estimate_duration_is_input_aware_and_capped():
92
+ """Duration should scale with tokens*beams, give small inputs a smaller reservation, and cap at 120s."""
93
+ import app
94
+
95
+ small = app._estimate_duration("hi", "French (fr)", max_new_tokens=10, num_beams=1)
96
+ default = app._estimate_duration("hi", "French (fr)", max_new_tokens=512, num_beams=1)
97
+ heavy = app._estimate_duration("hi", "French (fr)", max_new_tokens=512, num_beams=8)
98
+ assert small < default <= 120
99
+ assert heavy == 120 # capped
100
+ assert all(isinstance(d, int) for d in (small, default, heavy))
101
+
102
+
103
  # --- UI component tests ---
104
 
105