Spaces:

darylalim
/

madlad-400-translate

Running on Zero

Daryl Lim Claude Opus 4.8 (1M context) commited on 3 days ago

Commit

ca97bb4

1 Parent(s): 2ba774e

docs: record ZeroGPU pickle and T5 attention constraints in app.py

Add two in-code guard-rails surfaced by the huggingface-zerogpu skill:

- translate(): note that @spaces.GPU return values cross the worker->main
boundary via pickle, so it must return the decoded str, never a CUDA
tensor (unpickling one triggers blocked torch.cuda._lazy_init() and
hangs the call).

- _load_model(): record that MADLAD-400 is a T5 model whose
relative-position-bias attention is eager-only (transformers rejects
sdpa / flash_attention_2 / flash_attention_3, verified empirically), so
the usual ZeroGPU attention speedups and AoTI do not apply.

Doc-only; no behavior change. 46 fast tests pass, ruff + ty clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Files changed (1) hide show

app.py +8 -0

app.py CHANGED Viewed

@@ -38,6 +38,10 @@ def _load_tokenizer() -> T5TokenizerFast:
 def _load_model() -> T5ForConditionalGeneration:
     device = _get_device()
     dtype = torch.float16 if device.type == "cuda" else torch.float32
     return AutoModelForSeq2SeqLM.from_pretrained(MODEL_NAME, dtype=dtype).to(device)
@@ -117,6 +121,10 @@ def translate(
     result = tokenizer.decode(outputs[0], skip_special_tokens=True)
     elapsed = time.perf_counter() - start
     print(f"[translate] max_new_tokens={max_new_tokens} num_beams={num_beams} took {elapsed:.1f}s")
     return result

 def _load_model() -> T5ForConditionalGeneration:
     device = _get_device()
     dtype = torch.float16 if device.type == "cuda" else torch.float32
+    # MADLAD-400 is a T5 model: its relative-position-bias attention is eager-only
+    # (transformers rejects sdpa / flash_attention_2 / flash_attention_3), so the usual
+    # ZeroGPU attention speedups don't apply. AoTI targets fixed-shape forward passes
+    # (e.g. diffusion), not T5's autoregressive .generate(), so it isn't wired up either.
     return AutoModelForSeq2SeqLM.from_pretrained(MODEL_NAME, dtype=dtype).to(device)
     result = tokenizer.decode(outputs[0], skip_special_tokens=True)
     elapsed = time.perf_counter() - start
     print(f"[translate] max_new_tokens={max_new_tokens} num_beams={num_beams} took {elapsed:.1f}s")
+    # @spaces.GPU return values cross the worker -> main process boundary via pickle.
+    # Return the decoded str, never the raw `outputs` CUDA tensor: unpickling a CUDA
+    # tensor in the main process triggers torch.cuda._lazy_init(), which ZeroGPU blocks
+    # and hangs the call. CPU tensors, numpy arrays, and plain str/objects are safe.
     return result