Spaces:
Running on Zero
Running on Zero
Daryl Lim Claude Opus 4.8 (1M context) commited on
Commit ·
ca97bb4
1
Parent(s): 2ba774e
docs: record ZeroGPU pickle and T5 attention constraints in app.py
Browse filesAdd two in-code guard-rails surfaced by the huggingface-zerogpu skill:
- translate(): note that @spaces.GPU return values cross the worker->main
boundary via pickle, so it must return the decoded str, never a CUDA
tensor (unpickling one triggers blocked torch.cuda._lazy_init() and
hangs the call).
- _load_model(): record that MADLAD-400 is a T5 model whose
relative-position-bias attention is eager-only (transformers rejects
sdpa / flash_attention_2 / flash_attention_3, verified empirically), so
the usual ZeroGPU attention speedups and AoTI do not apply.
Doc-only; no behavior change. 46 fast tests pass, ruff + ty clean.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
app.py
CHANGED
|
@@ -38,6 +38,10 @@ def _load_tokenizer() -> T5TokenizerFast:
|
|
| 38 |
def _load_model() -> T5ForConditionalGeneration:
|
| 39 |
device = _get_device()
|
| 40 |
dtype = torch.float16 if device.type == "cuda" else torch.float32
|
|
|
|
|
|
|
|
|
|
|
|
|
| 41 |
return AutoModelForSeq2SeqLM.from_pretrained(MODEL_NAME, dtype=dtype).to(device)
|
| 42 |
|
| 43 |
|
|
@@ -117,6 +121,10 @@ def translate(
|
|
| 117 |
result = tokenizer.decode(outputs[0], skip_special_tokens=True)
|
| 118 |
elapsed = time.perf_counter() - start
|
| 119 |
print(f"[translate] max_new_tokens={max_new_tokens} num_beams={num_beams} took {elapsed:.1f}s")
|
|
|
|
|
|
|
|
|
|
|
|
|
| 120 |
return result
|
| 121 |
|
| 122 |
|
|
|
|
| 38 |
def _load_model() -> T5ForConditionalGeneration:
|
| 39 |
device = _get_device()
|
| 40 |
dtype = torch.float16 if device.type == "cuda" else torch.float32
|
| 41 |
+
# MADLAD-400 is a T5 model: its relative-position-bias attention is eager-only
|
| 42 |
+
# (transformers rejects sdpa / flash_attention_2 / flash_attention_3), so the usual
|
| 43 |
+
# ZeroGPU attention speedups don't apply. AoTI targets fixed-shape forward passes
|
| 44 |
+
# (e.g. diffusion), not T5's autoregressive .generate(), so it isn't wired up either.
|
| 45 |
return AutoModelForSeq2SeqLM.from_pretrained(MODEL_NAME, dtype=dtype).to(device)
|
| 46 |
|
| 47 |
|
|
|
|
| 121 |
result = tokenizer.decode(outputs[0], skip_special_tokens=True)
|
| 122 |
elapsed = time.perf_counter() - start
|
| 123 |
print(f"[translate] max_new_tokens={max_new_tokens} num_beams={num_beams} took {elapsed:.1f}s")
|
| 124 |
+
# @spaces.GPU return values cross the worker -> main process boundary via pickle.
|
| 125 |
+
# Return the decoded str, never the raw `outputs` CUDA tensor: unpickling a CUDA
|
| 126 |
+
# tensor in the main process triggers torch.cuda._lazy_init(), which ZeroGPU blocks
|
| 127 |
+
# and hangs the call. CPU tensors, numpy arrays, and plain str/objects are safe.
|
| 128 |
return result
|
| 129 |
|
| 130 |
|