Daryl Lim Claude Opus 4.8 (1M context) commited on
Commit
ca97bb4
·
1 Parent(s): 2ba774e

docs: record ZeroGPU pickle and T5 attention constraints in app.py

Browse files

Add two in-code guard-rails surfaced by the huggingface-zerogpu skill:

- translate(): note that @spaces.GPU return values cross the worker->main
boundary via pickle, so it must return the decoded str, never a CUDA
tensor (unpickling one triggers blocked torch.cuda._lazy_init() and
hangs the call).

- _load_model(): record that MADLAD-400 is a T5 model whose
relative-position-bias attention is eager-only (transformers rejects
sdpa / flash_attention_2 / flash_attention_3, verified empirically), so
the usual ZeroGPU attention speedups and AoTI do not apply.

Doc-only; no behavior change. 46 fast tests pass, ruff + ty clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Files changed (1) hide show
  1. app.py +8 -0
app.py CHANGED
@@ -38,6 +38,10 @@ def _load_tokenizer() -> T5TokenizerFast:
38
  def _load_model() -> T5ForConditionalGeneration:
39
  device = _get_device()
40
  dtype = torch.float16 if device.type == "cuda" else torch.float32
 
 
 
 
41
  return AutoModelForSeq2SeqLM.from_pretrained(MODEL_NAME, dtype=dtype).to(device)
42
 
43
 
@@ -117,6 +121,10 @@ def translate(
117
  result = tokenizer.decode(outputs[0], skip_special_tokens=True)
118
  elapsed = time.perf_counter() - start
119
  print(f"[translate] max_new_tokens={max_new_tokens} num_beams={num_beams} took {elapsed:.1f}s")
 
 
 
 
120
  return result
121
 
122
 
 
38
  def _load_model() -> T5ForConditionalGeneration:
39
  device = _get_device()
40
  dtype = torch.float16 if device.type == "cuda" else torch.float32
41
+ # MADLAD-400 is a T5 model: its relative-position-bias attention is eager-only
42
+ # (transformers rejects sdpa / flash_attention_2 / flash_attention_3), so the usual
43
+ # ZeroGPU attention speedups don't apply. AoTI targets fixed-shape forward passes
44
+ # (e.g. diffusion), not T5's autoregressive .generate(), so it isn't wired up either.
45
  return AutoModelForSeq2SeqLM.from_pretrained(MODEL_NAME, dtype=dtype).to(device)
46
 
47
 
 
121
  result = tokenizer.decode(outputs[0], skip_special_tokens=True)
122
  elapsed = time.perf_counter() - start
123
  print(f"[translate] max_new_tokens={max_new_tokens} num_beams={num_beams} took {elapsed:.1f}s")
124
+ # @spaces.GPU return values cross the worker -> main process boundary via pickle.
125
+ # Return the decoded str, never the raw `outputs` CUDA tensor: unpickling a CUDA
126
+ # tensor in the main process triggers torch.cuda._lazy_init(), which ZeroGPU blocks
127
+ # and hangs the call. CPU tensors, numpy arrays, and plain str/objects are safe.
128
  return result
129
 
130