Spaces:

karlexmarin
/

taf-agent

Running

karlexmarin Claude Opus 4.7 (1M context) commited on 27 days ago

Commit

12e81e6

1 Parent(s): 22784b8

v0.9.4: Launch-Flag Generator mode + Zenodo record update

Launch-Flag Generator: model + GPU + context → the exact llama.cpp / Ollama
launch command, the question the VRAM calculators don't answer (they say
"fits", not "here's the command").

- js/launch_flags.js: VRAM model (weights from bits/param via exact decoder
param count — attention+SwiGLU+embeddings with GQA, not the 12·h² shortcut
that undercounts large-FFN models like Qwen2.5-7B; KV from head geometry;
coarse scratch). Computes -ngl layer offload, fit verdict, and the TAF
horizon check: warns when target context is past d_horizon (KV memory
wasted). launchCommands() emits llama-server + Ollama snippets with -c, -fa,
-ctk/-ctv, --no-mmap (Blackwell OOM fix).
- index.html: tab + tile + #launch-section (GPU presets, quant, cache, FA) +
help v0.9.4. main.js: import, wiring, autocomplete auto-fetch, render.
- i18n.js: full EN/ES/FR/ZH.

Also: updated the paper Zenodo link 19826343 → 20314038 across the app
(index.html, i18n.js 4 langs) and tracked docs/README citations.

Test (test_launch.mjs): 21/21 — fetch geometry, FITS/PARTIAL verdicts,
--no-mmap on full offload, -ctk on cache quant, beyond-trained warning, 4
languages. 25 modes total, 0 JS errors.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Files changed (10) hide show

README.md +2 -2
docs/hf-post-v053-fix.md +1 -1
hf-post-announcement.md +1 -1
hf-space-readme.md +1 -1
index.html +69 -1
js/i18n.js +168 -4
js/launch_flags.js +170 -0
js/main.js +126 -1
registry-bootstrap/README.md +1 -1
test_launch.mjs +77 -0

README.md CHANGED Viewed

@@ -46,7 +46,7 @@ language:
 **🌐 Live**: https://karlesmarin.github.io/tafagent  ·  HF Space: https://huggingface.co/spaces/karlexmarin/taf-agent
 **📦 Source**: https://github.com/karlesmarin/tafagent  ·  Lean repo: https://github.com/karlesmarin/lean-taf
-**📄 Paper**: [Predicting How Transformers Attend — Marin 2026](https://zenodo.org/records/19826343)
 **🗂️ Dataset**: [taf-attention-decay (58 measurements, 32 models)](https://huggingface.co/datasets/karlexmarin/taf-attention-decay)
 ---
@@ -413,7 +413,7 @@ If this tool helps you — paper or code:
 Analytic Power-Law Theory, Phase Transitions, and Practical Compression
 Tools},
   year    = {2026},
-  url     = {https://zenodo.org/records/19826343},
 }
 @misc{marin2026tafagent,

 **🌐 Live**: https://karlesmarin.github.io/tafagent  ·  HF Space: https://huggingface.co/spaces/karlexmarin/taf-agent
 **📦 Source**: https://github.com/karlesmarin/tafagent  ·  Lean repo: https://github.com/karlesmarin/lean-taf
+**📄 Paper**: [Predicting How Transformers Attend — Marin 2026](https://zenodo.org/records/20314038)
 **🗂️ Dataset**: [taf-attention-decay (58 measurements, 32 models)](https://huggingface.co/datasets/karlexmarin/taf-attention-decay)
 ---
 Analytic Power-Law Theory, Phase Transitions, and Practical Compression
 Tools},
   year    = {2026},
+  url     = {https://zenodo.org/records/20314038},
 }
 @misc{marin2026tafagent,

docs/hf-post-v053-fix.md CHANGED Viewed

@@ -156,5 +156,5 @@ If you spot anything else wrong — please open an issue.
 **Links**:
 - Live: https://huggingface.co/spaces/karlexmarin/taf-agent
 - Source: https://github.com/karlesmarin/tafagent
-- Paper: https://zenodo.org/records/19826343
 - Dataset: https://huggingface.co/datasets/karlexmarin/taf-attention-decay

 **Links**:
 - Live: https://huggingface.co/spaces/karlexmarin/taf-agent
 - Source: https://github.com/karlesmarin/tafagent
+- Paper: https://zenodo.org/records/20314038
 - Dataset: https://huggingface.co/datasets/karlexmarin/taf-attention-decay

hf-post-announcement.md CHANGED Viewed

@@ -5,7 +5,7 @@ No server, no auth, no cost. Runs entirely in your browser.
 🌐 **Try it**: https://huggingface.co/spaces/karlexmarin/taf-agent
 📦 **Source**: https://github.com/karlesmarin/tafagent
-📄 **Paper**: [Predicting How Transformers Attend](https://zenodo.org/records/19826343)
 ## What it answers

 🌐 **Try it**: https://huggingface.co/spaces/karlexmarin/taf-agent
 📦 **Source**: https://github.com/karlesmarin/tafagent
+📄 **Paper**: [Predicting How Transformers Attend](https://zenodo.org/records/20314038)
 ## What it answers

hf-space-readme.md CHANGED Viewed

@@ -66,7 +66,7 @@ Predicts practical viability of any transformer LLM from its config alone:
 ## Underlying paper
-[Marin 2026 — Predicting How Transformers Attend](https://zenodo.org/records/19826343)
 ## Source

 ## Underlying paper
+[Marin 2026 — Predicting How Transformers Attend](https://zenodo.org/records/20314038)
 ## Source

index.html CHANGED Viewed

@@ -249,6 +249,9 @@
       <p><strong data-i18n="help.v091.gguf.title">🧊 GGUF Validity Bridge</strong></p>
       <p data-i18n="help.v091.gguf.body">The dozen GGUF/VRAM calculators (NyxKrage, oobabooga, …) read a <code>.gguf</code> header to tell you if a quant <em>fits in your GPU</em>. This reads the same header — via HTTP Range, so no multi-GB download — and answers the question they skip: <em>does it fit AND still work?</em> Paste a GGUF repo, pick a quant file; the bridge pulls <code>rope_theta</code>, <code>context_length</code>, the quant scheme (from <code>general.file_type</code> or the filename), and head geometry, then runs TAF's γ_Padé / d_horizon plus the architecture-aware quant-regime γ-shift. Output: effective attention horizon at the trained context, how far the quant erodes γ (and ΔPPL) for <em>this</em> model, and a verdict — HEALTHY / USABLE-WITH-CARE / DEGRADES. <em>Use case</em>: 'unsloth/Qwen3.5-9B-GGUF Q4_K_M fits 8GB — but is it brain-dead past 30K?' → see the horizon and the Q4 γ-penalty before you download 6 GB.</p>
       <h3 data-i18n="help.audit.title">The audit chain</h3>
       <p data-i18n="help.audit.body">Every result shows the full <strong>Computation Chain</strong> — each formula step with its inputs,
       output, and interpretation. Click any step to expand. Cite section numbers (§26.1, §19.1, etc.) refer
@@ -282,7 +285,7 @@
       <h3 data-i18n="help.source.title">Source &amp; paper</h3>
       <p data-i18n="help.source.body">Source code: <a href="https://github.com/karlesmarin/tafagent" target="_blank">github.com/karlesmarin/tafagent</a><br>
-      Paper: <em>Marin 2026 — Predicting How Transformers Attend</em> (<a href="https://zenodo.org/records/19826343" target="_blank">Zenodo</a>; arXiv forthcoming)<br>
       Dataset: <a href="https://huggingface.co/datasets/karlexmarin/taf-attention-decay" target="_blank">taf-attention-decay</a> — 58 γ-measurements across 32 models (CC-BY-4.0)</p>
     </div>
   </div>
@@ -412,6 +415,7 @@
             <button data-mode-link="quant" data-i18n="modes.quant">⚖️ Quant</button>
             <button data-mode-link="yarn" data-i18n="modes.yarn">🧵 YaRN Planner</button>
             <button data-mode-link="gguf" data-i18n="modes.gguf">🧊 GGUF Bridge</button>
             <button data-mode-link="inspector" data-i18n="modes.inspector">🔍 Inspect config</button>
           </div>
         </div>
@@ -508,6 +512,7 @@
         <button class="mode-btn" data-mode="hub" role="tab" aria-selected="false" data-i18n="modes.hub">🧭 Solutions</button>
         <button class="mode-btn" data-mode="yarn" role="tab" aria-selected="false" data-i18n="modes.yarn">🧵 YaRN Planner</button>
         <button class="mode-btn" data-mode="gguf" role="tab" aria-selected="false" data-i18n="modes.gguf">🧊 GGUF Bridge</button>
       </div>
       <p id="mode-desc" class="recipe-desc" data-i18n="modes.desc">
         <strong>Quickest start</strong>: paste any HuggingFace model id (e.g. <code>meta-llama/Meta-Llama-3-8B</code>),
@@ -1333,6 +1338,69 @@
       <div id="gguf-output" style="display:none; margin-top:1em;"></div>
     </section>
     <!-- Recipe selector (mode=recipe) -->
     <section id="recipe-section" style="display:none;">
       <h2 data-i18n="recipe.title">📋 Recipe</h2>

       <p><strong data-i18n="help.v091.gguf.title">🧊 GGUF Validity Bridge</strong></p>
       <p data-i18n="help.v091.gguf.body">The dozen GGUF/VRAM calculators (NyxKrage, oobabooga, …) read a <code>.gguf</code> header to tell you if a quant <em>fits in your GPU</em>. This reads the same header — via HTTP Range, so no multi-GB download — and answers the question they skip: <em>does it fit AND still work?</em> Paste a GGUF repo, pick a quant file; the bridge pulls <code>rope_theta</code>, <code>context_length</code>, the quant scheme (from <code>general.file_type</code> or the filename), and head geometry, then runs TAF's γ_Padé / d_horizon plus the architecture-aware quant-regime γ-shift. Output: effective attention horizon at the trained context, how far the quant erodes γ (and ΔPPL) for <em>this</em> model, and a verdict — HEALTHY / USABLE-WITH-CARE / DEGRADES. <em>Use case</em>: 'unsloth/Qwen3.5-9B-GGUF Q4_K_M fits 8GB — but is it brain-dead past 30K?' → see the horizon and the Q4 γ-penalty before you download 6 GB.</p>
+      <p><strong data-i18n="help.v094.launch.title">🚀 Launch-Flag Generator</strong></p>
+      <p data-i18n="help.v094.launch.body">The VRAM calculators tell you <em>whether</em> a model fits; they don't hand you the command. This does. Pick a model (fetches geometry from HF <code>config.json</code>), a quant, a GPU and a target context — it computes the VRAM breakdown (weights + KV cache + scratch), how many layers to offload (<code>-ngl</code>), and emits the copy-paste <code>llama-server</code> and Ollama commands with <code>-c</code> context, <code>-fa</code> flash-attention, KV-cache type, and <code>--no-mmap</code> (the Blackwell OOM fix: force all weights into physical VRAM). Plus the TAF reality check no calculator gives: if you're allocating KV for a context past the model's d_horizon, it warns you that memory is wasted — the attention won't reach there. <em>Use case</em>: 'What <code>-ngl</code> for Llama-70B-Q4 on my 4090?' → 39 of 80 layers, exact command, and a note if your context is past the usable horizon.</p>
       <h3 data-i18n="help.audit.title">The audit chain</h3>
       <p data-i18n="help.audit.body">Every result shows the full <strong>Computation Chain</strong> — each formula step with its inputs,
       output, and interpretation. Click any step to expand. Cite section numbers (§26.1, §19.1, etc.) refer
       <h3 data-i18n="help.source.title">Source &amp; paper</h3>
       <p data-i18n="help.source.body">Source code: <a href="https://github.com/karlesmarin/tafagent" target="_blank">github.com/karlesmarin/tafagent</a><br>
+      Paper: <em>Marin 2026 — Predicting How Transformers Attend</em> (<a href="https://zenodo.org/records/20314038" target="_blank">Zenodo</a>; arXiv forthcoming)<br>
       Dataset: <a href="https://huggingface.co/datasets/karlexmarin/taf-attention-decay" target="_blank">taf-attention-decay</a> — 58 γ-measurements across 32 models (CC-BY-4.0)</p>
     </div>
   </div>
             <button data-mode-link="quant" data-i18n="modes.quant">⚖️ Quant</button>
             <button data-mode-link="yarn" data-i18n="modes.yarn">🧵 YaRN Planner</button>
             <button data-mode-link="gguf" data-i18n="modes.gguf">🧊 GGUF Bridge</button>
+            <button data-mode-link="launch" data-i18n="modes.launch">🚀 Launch Flags</button>
             <button data-mode-link="inspector" data-i18n="modes.inspector">🔍 Inspect config</button>
           </div>
         </div>
         <button class="mode-btn" data-mode="hub" role="tab" aria-selected="false" data-i18n="modes.hub">🧭 Solutions</button>
         <button class="mode-btn" data-mode="yarn" role="tab" aria-selected="false" data-i18n="modes.yarn">🧵 YaRN Planner</button>
         <button class="mode-btn" data-mode="gguf" role="tab" aria-selected="false" data-i18n="modes.gguf">🧊 GGUF Bridge</button>
+        <button class="mode-btn" data-mode="launch" role="tab" aria-selected="false" data-i18n="modes.launch">🚀 Launch Flags</button>
       </div>
       <p id="mode-desc" class="recipe-desc" data-i18n="modes.desc">
         <strong>Quickest start</strong>: paste any HuggingFace model id (e.g. <code>meta-llama/Meta-Llama-3-8B</code>),
       <div id="gguf-output" style="display:none; margin-top:1em;"></div>
     </section>
+    <!-- Launch-flag generator (mode=launch) -->
+    <section id="launch-section" style="display:none;">
+      <h2><span data-i18n="launch.title">🚀 Launch-Flag Generator</span>
+        <span class="info"><span class="tooltip" data-i18n="launch.tip">
+          <strong>Exact flags + why, not just "fits"</strong>. The VRAM calculators tell you whether a
+          model fits. This gives you the copy-paste <code>llama.cpp</code> / <code>Ollama</code> command —
+          <code>-ngl</code> layers to offload, <code>-c</code> context, <code>--no-mmap</code>,
+          KV-cache type — AND the TAF reality check: if you allocate KV for 128K but the model's
+          attention horizon is 32K, that VRAM is wasted.
+        </span></span>
+      </h2>
+      <p class="recipe-desc" data-i18n="launch.desc">
+        Pick a model, GPU and target context → get the exact launch command, a VRAM breakdown
+        (weights + KV cache + overhead), and how many layers to offload. Solves the recurring
+        "what <code>-ngl</code> do I use?" / Blackwell OOM guesswork.
+      </p>
+      <div class="form-row">
+        <label for="launch-model" data-i18n="launch.model_label">HF model id:</label>
+        <input type="text" id="launch-model" placeholder="Qwen/Qwen2.5-7B-Instruct">
+        <button id="launch-fetch-btn" class="secondary" data-i18n="launch.fetch_btn">📥 Fetch geometry</button>
+      </div>
+      <span id="launch-status" class="subtle"></span>
+      <div class="form-row">
+        <label for="launch-quant" data-i18n="launch.quant_label">Quant:</label>
+        <select id="launch-quant">
+          <option value="Q4_K_M">Q4_K_M (4-bit, sweet spot)</option>
+          <option value="Q8_0">Q8_0 (8-bit)</option>
+          <option value="Q6_K">Q6_K</option>
+          <option value="Q5_K_M">Q5_K_M</option>
+          <option value="Q4_0">Q4_0</option>
+          <option value="Q3_K_M">Q3_K_M</option>
+          <option value="Q2_K">Q2_K (extreme)</option>
+          <option value="F16">F16 (full)</option>
+        </select>
+      </div>
+      <div class="form-row">
+        <label for="launch-gpu" data-i18n="launch.gpu_label">GPU:</label>
+        <select id="launch-gpu"></select>
+        <input type="number" id="launch-vram" placeholder="or custom VRAM (GB)" min="1" style="width:11em;">
+      </div>
+      <div class="form-row">
+        <label for="launch-ctx" data-i18n="launch.ctx_label">Target context L:</label>
+        <input type="number" id="launch-ctx" placeholder="32768" min="256">
+      </div>
+      <div class="form-row">
+        <label data-i18n="launch.adv_label">Advanced:</label>
+        <span>
+          <label data-i18n="launch.cache_label">KV cache:</label>
+          <select id="launch-cache">
+            <option value="fp16">fp16</option>
+            <option value="q8_0">q8_0 (½ KV)</option>
+            <option value="q4_0">q4_0 (¼ KV)</option>
+          </select>
+          &nbsp;
+          <label><input type="checkbox" id="launch-fa" checked> <span data-i18n="launch.fa_label">Flash attention (-fa)</span></label>
+        </span>
+      </div>
+      <button id="launch-gen-btn" data-i18n="launch.gen_btn">🚀 Generate flags</button>
+      <div id="launch-output" style="display:none; margin-top:1em;"></div>
+    </section>
     <!-- Recipe selector (mode=recipe) -->
     <section id="recipe-section" style="display:none;">
       <h2 data-i18n="recipe.title">📋 Recipe</h2>

js/i18n.js CHANGED Viewed

@@ -429,6 +429,47 @@ export const TRANSLATIONS = {
     "mode_desc.yarn":              "Generate the exact rope_scaling config to extend a model past its trained context — plus a TAF verdict on whether attention quality actually holds at the target length.",
     "modes.gguf":                  "🧊 GGUF Bridge",
     "mode_desc.gguf":              "Read a GGUF file's metadata header (rope_theta, context_length, quant) in your browser and get a TAF quality verdict — the question the VRAM calculators skip: fits AND works?",
     "gguf.title":                  "🧊 GGUF Validity Bridge",
     "gguf.tip":                    "<strong>Fits in VRAM ≠ works</strong>. The GGUF/VRAM calculators read a model's metadata to tell you if a quant <em>fits in your GPU</em>. This reads the SAME metadata (rope_theta, context_length, quant scheme, head geometry) straight from the <code>.gguf</code> header via HTTP Range — no multi-GB download — and answers the question they don't: does attention quality actually hold, and how much does the quant erode it (γ-shift, ΔPPL)?",
     "gguf.desc":                   "Paste a GGUF repo (e.g. <code>Qwen/Qwen2.5-7B-Instruct-GGUF</code>), pick a quant file, and get a TAF quality verdict: the model's effective attention horizon, plus how much the chosen quantization shifts γ for <em>this specific architecture</em>. Reads only the file header in your browser.",
@@ -1059,7 +1100,7 @@ export const TRANSLATIONS = {
     "help.privacy.title":       "Privacy",
     "help.privacy.body":        "Everything runs in your browser. No telemetry, no analytics, no data sent anywhere. Even the LLM model runs locally via WebGPU/WebAssembly. Your model_ids and questions never leave this page.",
     "help.source.title":        "Source & paper",
-    "help.source.body":         "Source code: <a href=\"https://github.com/karlesmarin/tafagent\" target=\"_blank\">github.com/karlesmarin/tafagent</a><br>Paper: <em>Marin 2026 — Predicting How Transformers Attend</em> (<a href=\"https://zenodo.org/records/19826343\" target=\"_blank\">Zenodo</a>; arXiv forthcoming)<br>Dataset: <a href=\"https://huggingface.co/datasets/karlexmarin/taf-attention-decay\" target=\"_blank\">taf-attention-decay</a> — 58 γ-measurements across 32 models (CC-BY-4.0)",
     "footer.text":             "© 2026 Carles Marin · Apache-2.0 · independent research · the tool that closes the loop of the paper.",
@@ -1778,6 +1819,47 @@ export const TRANSLATIONS = {
     "mode_desc.yarn":              "Genera la configuración rope_scaling exacta para extender un modelo más allá de su contexto entrenado — más un veredicto TAF sobre si la calidad de atención aguanta realmente a la longitud objetivo.",
     "modes.gguf":                  "🧊 Puente GGUF",
     "mode_desc.gguf":              "Lee la cabecera de metadata de un archivo GGUF (rope_theta, context_length, quant) en tu navegador y obtén un veredicto de calidad TAF — la pregunta que los calculadores de VRAM ignoran: ¿cabe Y funciona?",
     "gguf.title":                  "🧊 Puente de validez GGUF",
     "gguf.tip":                    "<strong>Caber en VRAM ≠ funcionar</strong>. Los calculadores GGUF/VRAM leen la metadata de un modelo para decirte si un quant <em>cabe en tu GPU</em>. Esto lee la MISMA metadata (rope_theta, context_length, esquema de quant, geometría de cabezas) directamente de la cabecera <code>.gguf</code> vía HTTP Range — sin descargar GB — y responde lo que ellos no: ¿aguanta de verdad la calidad de atención, y cuánto la erosiona el quant (γ-shift, ΔPPL)?",
     "gguf.desc":                   "Pega un repo GGUF (p.ej. <code>Qwen/Qwen2.5-7B-Instruct-GGUF</code>), elige un archivo de quant, y obtén un veredicto de calidad TAF: el horizonte de atención efectivo del modelo, más cuánto desplaza γ la cuantización elegida para <em>esta arquitectura concreta</em>. Solo lee la cabecera del archivo en tu navegador.",
@@ -2408,7 +2490,7 @@ export const TRANSLATIONS = {
     "help.privacy.title":       "Privacidad",
     "help.privacy.body":        "Todo corre en tu navegador. Sin telemetría, sin analytics, sin datos enviados a ningún sitio. Incluso el modelo LLM corre localmente vía WebGPU/WebAssembly. Tus model_ids y preguntas nunca abandonan esta página.",
     "help.source.title":        "Código fuente y paper",
-    "help.source.body":         "Código: <a href=\"https://github.com/karlesmarin/tafagent\" target=\"_blank\">github.com/karlesmarin/tafagent</a><br>Paper: <em>Marin 2026 — Predicting How Transformers Attend</em> (<a href=\"https://zenodo.org/records/19826343\" target=\"_blank\">Zenodo</a>; arXiv próximamente)<br>Dataset: <a href=\"https://huggingface.co/datasets/karlexmarin/taf-attention-decay\" target=\"_blank\">taf-attention-decay</a> — 58 mediciones γ sobre 32 modelos (CC-BY-4.0)",
     "footer.text":             "© 2026 Carles Marin · Apache-2.0 · investigación independiente · la herramienta que cierra el círculo del paper.",
   },
@@ -2981,6 +3063,47 @@ export const TRANSLATIONS = {
     "mode_desc.yarn":              "Génère la configuration rope_scaling exacte pour étendre un modèle au-delà de son contexte d'entraînement — plus un verdict TAF sur la tenue réelle de la qualité d'attention à la longueur cible.",
     "modes.gguf":                  "🧊 Pont GGUF",
     "mode_desc.gguf":              "Lit l'en-tête de métadonnées d'un fichier GGUF (rope_theta, context_length, quant) dans votre navigateur et donne un verdict de qualité TAF — la question que les calculateurs de VRAM ignorent : tient ET fonctionne ?",
     "gguf.title":                  "🧊 Pont de validité GGUF",
     "gguf.tip":                    "<strong>Tenir dans la VRAM ≠ fonctionner</strong>. Les calculateurs GGUF/VRAM lisent les métadonnées d'un modèle pour dire si un quant <em>tient dans le GPU</em>. Ceci lit les MÊMES métadonnées (rope_theta, context_length, schéma de quant, géométrie des têtes) directement depuis l'en-tête <code>.gguf</code> via HTTP Range — sans télécharger des Go — et répond à ce qu'ils n'abordent pas : la qualité d'attention tient-elle vraiment, et de combien le quant l'érode-t-il (γ-shift, ΔPPL) ?",
     "gguf.desc":                   "Collez un dépôt GGUF (ex. <code>Qwen/Qwen2.5-7B-Instruct-GGUF</code>), choisissez un fichier de quant, et obtenez un verdict de qualité TAF : l'horizon d'attention effectif du modèle, plus de combien la quantification choisie décale γ pour <em>cette architecture précise</em>. Ne lit que l'en-tête du fichier dans votre navigateur.",
@@ -3611,7 +3734,7 @@ export const TRANSLATIONS = {
     "help.privacy.title":       "Confidentialité",
     "help.privacy.body":        "Tout s'exécute dans votre navigateur. Pas de télémétrie, pas d'analytique, pas de données envoyées ailleurs. Même le modèle LLM s'exécute localement via WebGPU/WebAssembly. Vos model_ids et questions ne quittent jamais cette page.",
     "help.source.title":        "Code source et paper",
-    "help.source.body":         "Code : <a href=\"https://github.com/karlesmarin/tafagent\" target=\"_blank\">github.com/karlesmarin/tafagent</a><br>Paper : <em>Marin 2026 — Predicting How Transformers Attend</em> (<a href=\"https://zenodo.org/records/19826343\" target=\"_blank\">Zenodo</a> ; arXiv à venir)<br>Dataset : <a href=\"https://huggingface.co/datasets/karlexmarin/taf-attention-decay\" target=\"_blank\">taf-attention-decay</a> — 58 mesures γ sur 32 modèles (CC-BY-4.0)",
     "footer.text":             "© 2026 Carles Marin · Apache-2.0 · recherche indépendante · l'outil qui ferme la boucle du paper.",
   },
@@ -4184,6 +4307,47 @@ export const TRANSLATIONS = {
     "mode_desc.yarn":              "生成精确的 rope_scaling 配置以将模型扩展到训练上下文之外 —— 外加 TAF 裁决：在目标长度下注意力质量是否真的撑得住。",
     "modes.gguf":                  "🧊 GGUF 桥",
     "mode_desc.gguf":              "在浏览器内读取 GGUF 文件的元数据头（rope_theta、context_length、量化），给出 TAF 质量裁决 —— 显存计算器跳过的那个问题：塞得进且还能用吗？",
     "gguf.title":                  "🧊 GGUF 有效性桥",
     "gguf.tip":                    "<strong>塞进显存 ≠ 能用</strong>。GGUF/显存计算器读取模型元数据来告诉你某量化<em>是否塞得进 GPU</em>。本工具通过 HTTP Range 直接从 <code>.gguf</code> 头读取同样的元数据（rope_theta、context_length、量化方案、注意力头几何）—— 无需下载数 GB —— 并回答它们不答的：注意力质量是否真的撑得住，量化又侵蚀了多少（γ-shift、ΔPPL）？",
     "gguf.desc":                   "粘贴一个 GGUF 仓库（如 <code>Qwen/Qwen2.5-7B-Instruct-GGUF</code>），选择一个量化文件，获得 TAF 质量裁决：模型的有效注意力视界，以及所选量化对<em>这个具体架构</em>的 γ 位移有多大。只在浏览器内读取文件头。",
@@ -4814,7 +4978,7 @@ export const TRANSLATIONS = {
     "help.privacy.title":       "隐私",
     "help.privacy.body":        "一切都在您的浏览器中运行。无遥测,无分析,无数据发送到任何地方。即使是 LLM 模型也通过 WebGPU/WebAssembly 在本地运行。您的 model_ids 和问题永不离开此页面。",
     "help.source.title":        "源代码和论文",
-    "help.source.body":         "源代码: <a href=\"https://github.com/karlesmarin/tafagent\" target=\"_blank\">github.com/karlesmarin/tafagent</a><br>论文: <em>Marin 2026 — Predicting How Transformers Attend</em> (<a href=\"https://zenodo.org/records/19826343\" target=\"_blank\">Zenodo</a>; arXiv 即将)<br>数据集: <a href=\"https://huggingface.co/datasets/karlexmarin/taf-attention-decay\" target=\"_blank\">taf-attention-decay</a> — 32个模型上的58次γ测量 (CC-BY-4.0)",
     "footer.text":             "© 2026 Carles Marin · Apache-2.0 · 独立研究 · 闭合论文回路的工具。",
   },

     "mode_desc.yarn":              "Generate the exact rope_scaling config to extend a model past its trained context — plus a TAF verdict on whether attention quality actually holds at the target length.",
     "modes.gguf":                  "🧊 GGUF Bridge",
     "mode_desc.gguf":              "Read a GGUF file's metadata header (rope_theta, context_length, quant) in your browser and get a TAF quality verdict — the question the VRAM calculators skip: fits AND works?",
+    "modes.launch":                "🚀 Launch Flags",
+    "mode_desc.launch":            "Model + GPU + context → the exact llama.cpp / Ollama launch command (-ngl, -c, --no-mmap, KV-cache type) with a VRAM breakdown and a TAF warning when your context is past the usable horizon.",
+    "launch.title":                "🚀 Launch-Flag Generator",
+    "launch.tip":                  "<strong>Exact flags + why, not just \"fits\"</strong>. The VRAM calculators tell you whether a model fits. This gives you the copy-paste <code>llama.cpp</code> / <code>Ollama</code> command — <code>-ngl</code> layers to offload, <code>-c</code> context, <code>--no-mmap</code>, KV-cache type — AND the TAF reality check: if you allocate KV for 128K but the model's attention horizon is 32K, that VRAM is wasted.",
+    "launch.desc":                 "Pick a model, GPU and target context → get the exact launch command, a VRAM breakdown (weights + KV cache + overhead), and how many layers to offload. Solves the recurring \"what <code>-ngl</code> do I use?\" / Blackwell OOM guesswork.",
+    "launch.model_label":          "HF model id:",
+    "launch.fetch_btn":            "📥 Fetch geometry",
+    "launch.quant_label":          "Quant:",
+    "launch.gpu_label":            "GPU:",
+    "launch.ctx_label":            "Target context L:",
+    "launch.adv_label":            "Advanced:",
+    "launch.cache_label":          "KV cache:",
+    "launch.fa_label":             "Flash attention (-fa)",
+    "launch.gen_btn":              "🚀 Generate flags",
+    "launch.need_id":              "Enter a model id like 'Qwen/Qwen2.5-7B-Instruct'",
+    "launch.fetching":             "Fetching config.json from HF Hub…",
+    "launch.layers":               "layers",
+    "launch.fetched_hint":         "Pick GPU + context, then Generate flags.",
+    "launch.need_fetch":           "Fetch a model first (📥 Fetch geometry).",
+    "launch.verdict.fits":         "FITS — fully on GPU",
+    "launch.verdict.partial":      "PARTIAL — some layers on CPU (slower)",
+    "launch.verdict.too_big":      "TOO BIG — won't fit any layers on this GPU",
+    "launch.r.weights":            "Weights",
+    "launch.r.kv":                 "KV cache",
+    "launch.r.overhead":           "Overhead / scratch",
+    "launch.r.total":              "Total",
+    "launch.r.ngl":                "Layers to offload (-ngl)",
+    "launch.r.all":                "all",
+    "launch.r.note":               "VRAM is an estimate (weights from bits/param, KV from head geometry, scratch coarse). d_horizon from γ_Padé. Verify the fit with a real load — leave ~1 GB headroom.",
+    "launch.warn.horizon_wasted":  "Target context is well past the model's attention horizon — KV memory for context beyond it is wasted; the model won't attend there. (TAF)",
+    "launch.warn.beyond_trained":  "L exceeds the trained context — you also need RoPE scaling to position-encode that far (see the YaRN Planner).",
+    "launch.warn.no_mmap":         "All layers fit → added --no-mmap to force weights into physical VRAM (avoids the Blackwell illegal-memory / OOM-at-load issue).",
+    "launch.warn.partial":         "Only some layers fit on GPU — the rest run on CPU (much slower). Drop to a smaller quant or shorter context to fit fully.",
+    "launch.warn.cpu_only":        "Won't fit any layers at these settings — CPU only. Use a smaller quant/context or a bigger GPU.",
+    "launch.warn.no_params":       "Couldn't read parameter count — weights size is a rough estimate from geometry.",
+    "launch.err.no_geom":          "Fetch a model first to read its geometry.",
+    "launch.err.no_gpu":           "Pick a GPU or enter a custom VRAM size.",
+    "launch.err.no_ctx":           "Enter a target context length L.",
+    "launch.copy":                 "Copy command",
+    "help.v094.launch.title":      "🚀 Launch-Flag Generator",
+    "help.v094.launch.body":       "The VRAM calculators tell you <em>whether</em> a model fits; they don't hand you the command. This does. Pick a model (fetches geometry from HF <code>config.json</code>), a quant, a GPU and a target context — it computes the VRAM breakdown (weights + KV cache + scratch), how many layers to offload (<code>-ngl</code>), and emits the copy-paste <code>llama-server</code> and Ollama commands with <code>-c</code> context, <code>-fa</code> flash-attention, KV-cache type, and <code>--no-mmap</code> (the Blackwell OOM fix). Plus the TAF reality check no calculator gives: if you're allocating KV for a context past the model's d_horizon, it warns you that memory is wasted. <em>Use case</em>: 'What <code>-ngl</code> for Llama-70B-Q4 on my 4090?' → 39 of 80 layers, exact command, and a note if your context is past the usable horizon.",
     "gguf.title":                  "🧊 GGUF Validity Bridge",
     "gguf.tip":                    "<strong>Fits in VRAM ≠ works</strong>. The GGUF/VRAM calculators read a model's metadata to tell you if a quant <em>fits in your GPU</em>. This reads the SAME metadata (rope_theta, context_length, quant scheme, head geometry) straight from the <code>.gguf</code> header via HTTP Range — no multi-GB download — and answers the question they don't: does attention quality actually hold, and how much does the quant erode it (γ-shift, ΔPPL)?",
     "gguf.desc":                   "Paste a GGUF repo (e.g. <code>Qwen/Qwen2.5-7B-Instruct-GGUF</code>), pick a quant file, and get a TAF quality verdict: the model's effective attention horizon, plus how much the chosen quantization shifts γ for <em>this specific architecture</em>. Reads only the file header in your browser.",
     "help.privacy.title":       "Privacy",
     "help.privacy.body":        "Everything runs in your browser. No telemetry, no analytics, no data sent anywhere. Even the LLM model runs locally via WebGPU/WebAssembly. Your model_ids and questions never leave this page.",
     "help.source.title":        "Source & paper",
+    "help.source.body":         "Source code: <a href=\"https://github.com/karlesmarin/tafagent\" target=\"_blank\">github.com/karlesmarin/tafagent</a><br>Paper: <em>Marin 2026 — Predicting How Transformers Attend</em> (<a href=\"https://zenodo.org/records/20314038\" target=\"_blank\">Zenodo</a>; arXiv forthcoming)<br>Dataset: <a href=\"https://huggingface.co/datasets/karlexmarin/taf-attention-decay\" target=\"_blank\">taf-attention-decay</a> — 58 γ-measurements across 32 models (CC-BY-4.0)",
     "footer.text":             "© 2026 Carles Marin · Apache-2.0 · independent research · the tool that closes the loop of the paper.",
     "mode_desc.yarn":              "Genera la configuración rope_scaling exacta para extender un modelo más allá de su contexto entrenado — más un veredicto TAF sobre si la calidad de atención aguanta realmente a la longitud objetivo.",
     "modes.gguf":                  "🧊 Puente GGUF",
     "mode_desc.gguf":              "Lee la cabecera de metadata de un archivo GGUF (rope_theta, context_length, quant) en tu navegador y obtén un veredicto de calidad TAF — la pregunta que los calculadores de VRAM ignoran: ¿cabe Y funciona?",
+    "modes.launch":                "🚀 Flags de arranque",
+    "mode_desc.launch":            "Modelo + GPU + contexto → el comando exacto de arranque llama.cpp / Ollama (-ngl, -c, --no-mmap, tipo de KV-cache) con desglose de VRAM y aviso TAF cuando tu contexto pasa el horizonte usable.",
+    "launch.title":                "🚀 Generador de flags de arranque",
+    "launch.tip":                  "<strong>Flags exactos + por qué, no solo \"cabe\"</strong>. Los calculadores de VRAM te dicen si un modelo cabe. Esto te da el comando <code>llama.cpp</code> / <code>Ollama</code> para pegar — <code>-ngl</code> capas a offload, <code>-c</code> contexto, <code>--no-mmap</code>, tipo de KV-cache — Y el chequeo de realidad TAF: si reservas KV para 128K pero el horizonte de atención del modelo es 32K, esa VRAM se desperdicia.",
+    "launch.desc":                 "Elige modelo, GPU y contexto objetivo → obtén el comando exacto, desglose de VRAM (pesos + KV cache + overhead), y cuántas capas hacer offload. Resuelve el típico \"¿qué <code>-ngl</code> uso?\" / OOM de Blackwell.",
+    "launch.model_label":          "ID del modelo HF:",
+    "launch.fetch_btn":            "📥 Obtener geometría",
+    "launch.quant_label":          "Quant:",
+    "launch.gpu_label":            "GPU:",
+    "launch.ctx_label":            "Contexto objetivo L:",
+    "launch.adv_label":            "Avanzado:",
+    "launch.cache_label":          "KV cache:",
+    "launch.fa_label":             "Flash attention (-fa)",
+    "launch.gen_btn":              "🚀 Generar flags",
+    "launch.need_id":              "Introduce un id de modelo como 'Qwen/Qwen2.5-7B-Instruct'",
+    "launch.fetching":             "Obteniendo config.json de HF Hub…",
+    "launch.layers":               "capas",
+    "launch.fetched_hint":         "Elige GPU + contexto, luego Generar flags.",
+    "launch.need_fetch":           "Obtén un modelo primero (📥 Obtener geometría).",
+    "launch.verdict.fits":         "CABE — todo en GPU",
+    "launch.verdict.partial":      "PARCIAL — algunas capas en CPU (más lento)",
+    "launch.verdict.too_big":      "DEMASIADO GRANDE — no cabe ninguna capa en esta GPU",
+    "launch.r.weights":            "Pesos",
+    "launch.r.kv":                 "KV cache",
+    "launch.r.overhead":           "Overhead / scratch",
+    "launch.r.total":              "Total",
+    "launch.r.ngl":                "Capas a offload (-ngl)",
+    "launch.r.all":                "todas",
+    "launch.r.note":               "La VRAM es una estimación (pesos por bits/param, KV por geometría de cabezas, scratch aproximado). d_horizon desde γ_Padé. Verifica el ajuste con una carga real — deja ~1 GB de margen.",
+    "launch.warn.horizon_wasted":  "El contexto objetivo pasa bastante el horizonte de atención del modelo — la KV para contexto más allá se desperdicia; el modelo no atenderá ahí. (TAF)",
+    "launch.warn.beyond_trained":  "L supera el contexto entrenado — también necesitas RoPE scaling para codificar posiciones tan lejos (ver Planificador YaRN).",
+    "launch.warn.no_mmap":         "Todas las capas caben → añadido --no-mmap para forzar los pesos a VRAM física (evita el problema de illegal-memory / OOM-al-cargar de Blackwell).",
+    "launch.warn.partial":         "Solo caben algunas capas en GPU — el resto corre en CPU (mucho más lento). Baja a un quant menor o contexto más corto para que quepa entero.",
+    "launch.warn.cpu_only":        "No cabe ninguna capa con estos ajustes — solo CPU. Usa un quant/contexto menor o una GPU mayor.",
+    "launch.warn.no_params":       "No se pudo leer el nº de parámetros — el tamaño de pesos es una estimación aproximada por geometría.",
+    "launch.err.no_geom":          "Obtén un modelo primero para leer su geometría.",
+    "launch.err.no_gpu":           "Elige una GPU o introduce un tamaño de VRAM personalizado.",
+    "launch.err.no_ctx":           "Introduce una longitud de contexto objetivo L.",
+    "launch.copy":                 "Copiar comando",
+    "help.v094.launch.title":      "🚀 Generador de flags de arranque",
+    "help.v094.launch.body":       "Los calculadores de VRAM te dicen <em>si</em> un modelo cabe; no te dan el comando. Esto sí. Elige un modelo (obtiene geometría del <code>config.json</code> de HF), un quant, una GPU y un contexto objetivo — calcula el desglose de VRAM (pesos + KV cache + scratch), cuántas capas hacer offload (<code>-ngl</code>), y emite los comandos para pegar de <code>llama-server</code> y Ollama con contexto <code>-c</code>, flash-attention <code>-fa</code>, tipo de KV-cache, y <code>--no-mmap</code> (el fix de OOM de Blackwell). Más el chequeo de realidad TAF que ningún calculador da: si reservas KV para un contexto más allá del d_horizon del modelo, te avisa de que esa memoria se desperdicia. <em>Caso de uso</em>: '¿Qué <code>-ngl</code> para Llama-70B-Q4 en mi 4090?' → 39 de 80 capas, comando exacto, y un aviso si tu contexto pasa el horizonte usable.",
     "gguf.title":                  "🧊 Puente de validez GGUF",
     "gguf.tip":                    "<strong>Caber en VRAM ≠ funcionar</strong>. Los calculadores GGUF/VRAM leen la metadata de un modelo para decirte si un quant <em>cabe en tu GPU</em>. Esto lee la MISMA metadata (rope_theta, context_length, esquema de quant, geometría de cabezas) directamente de la cabecera <code>.gguf</code> vía HTTP Range — sin descargar GB — y responde lo que ellos no: ¿aguanta de verdad la calidad de atención, y cuánto la erosiona el quant (γ-shift, ΔPPL)?",
     "gguf.desc":                   "Pega un repo GGUF (p.ej. <code>Qwen/Qwen2.5-7B-Instruct-GGUF</code>), elige un archivo de quant, y obtén un veredicto de calidad TAF: el horizonte de atención efectivo del modelo, más cuánto desplaza γ la cuantización elegida para <em>esta arquitectura concreta</em>. Solo lee la cabecera del archivo en tu navegador.",
     "help.privacy.title":       "Privacidad",
     "help.privacy.body":        "Todo corre en tu navegador. Sin telemetría, sin analytics, sin datos enviados a ningún sitio. Incluso el modelo LLM corre localmente vía WebGPU/WebAssembly. Tus model_ids y preguntas nunca abandonan esta página.",
     "help.source.title":        "Código fuente y paper",
+    "help.source.body":         "Código: <a href=\"https://github.com/karlesmarin/tafagent\" target=\"_blank\">github.com/karlesmarin/tafagent</a><br>Paper: <em>Marin 2026 — Predicting How Transformers Attend</em> (<a href=\"https://zenodo.org/records/20314038\" target=\"_blank\">Zenodo</a>; arXiv próximamente)<br>Dataset: <a href=\"https://huggingface.co/datasets/karlexmarin/taf-attention-decay\" target=\"_blank\">taf-attention-decay</a> — 58 mediciones γ sobre 32 modelos (CC-BY-4.0)",
     "footer.text":             "© 2026 Carles Marin · Apache-2.0 · investigación independiente · la herramienta que cierra el círculo del paper.",
   },
     "mode_desc.yarn":              "Génère la configuration rope_scaling exacte pour étendre un modèle au-delà de son contexte d'entraînement — plus un verdict TAF sur la tenue réelle de la qualité d'attention à la longueur cible.",
     "modes.gguf":                  "🧊 Pont GGUF",
     "mode_desc.gguf":              "Lit l'en-tête de métadonnées d'un fichier GGUF (rope_theta, context_length, quant) dans votre navigateur et donne un verdict de qualité TAF — la question que les calculateurs de VRAM ignorent : tient ET fonctionne ?",
+    "modes.launch":                "🚀 Flags de lancement",
+    "mode_desc.launch":            "Modèle + GPU + contexte → la commande exacte llama.cpp / Ollama (-ngl, -c, --no-mmap, type de KV-cache) avec ventilation VRAM et alerte TAF quand le contexte dépasse l'horizon utile.",
+    "launch.title":                "🚀 Générateur de flags de lancement",
+    "launch.tip":                  "<strong>Flags exacts + pourquoi, pas juste \"tient\"</strong>. Les calculateurs de VRAM disent si un modèle tient. Ceci donne la commande <code>llama.cpp</code> / <code>Ollama</code> à coller — <code>-ngl</code> couches à décharger, <code>-c</code> contexte, <code>--no-mmap</code>, type de KV-cache — ET le contrôle de réalité TAF : si vous allouez du KV pour 128K mais que l'horizon d'attention du modèle est 32K, cette VRAM est gâchée.",
+    "launch.desc":                 "Choisissez un modèle, un GPU et un contexte cible → obtenez la commande exacte, une ventilation VRAM (poids + KV cache + overhead), et combien de couches décharger. Résout le \"quel <code>-ngl</code> ?\" / OOM Blackwell récurrent.",
+    "launch.model_label":          "ID du modèle HF :",
+    "launch.fetch_btn":            "📥 Récupérer la géométrie",
+    "launch.quant_label":          "Quant :",
+    "launch.gpu_label":            "GPU :",
+    "launch.ctx_label":            "Contexte cible L :",
+    "launch.adv_label":            "Avancé :",
+    "launch.cache_label":          "KV cache :",
+    "launch.fa_label":             "Flash attention (-fa)",
+    "launch.gen_btn":              "🚀 Générer les flags",
+    "launch.need_id":              "Saisissez un id de modèle comme 'Qwen/Qwen2.5-7B-Instruct'",
+    "launch.fetching":             "Récupération de config.json depuis HF Hub…",
+    "launch.layers":               "couches",
+    "launch.fetched_hint":         "Choisissez GPU + contexte, puis Générer les flags.",
+    "launch.need_fetch":           "Récupérez d'abord un modèle (📥 Récupérer la géométrie).",
+    "launch.verdict.fits":         "TIENT — entièrement sur GPU",
+    "launch.verdict.partial":      "PARTIEL — certaines couches sur CPU (plus lent)",
+    "launch.verdict.too_big":      "TROP GROS — aucune couche ne tient sur ce GPU",
+    "launch.r.weights":            "Poids",
+    "launch.r.kv":                 "KV cache",
+    "launch.r.overhead":           "Overhead / scratch",
+    "launch.r.total":              "Total",
+    "launch.r.ngl":                "Couches à décharger (-ngl)",
+    "launch.r.all":                "toutes",
+    "launch.r.note":               "La VRAM est une estimation (poids par bits/param, KV par géométrie des têtes, scratch grossier). d_horizon depuis γ_Padé. Vérifiez avec un chargement réel — laissez ~1 Go de marge.",
+    "launch.warn.horizon_wasted":  "Le contexte cible dépasse largement l'horizon d'attention du modèle — le KV au-delà est gâché ; le modèle n'y prêtera pas attention. (TAF)",
+    "launch.warn.beyond_trained":  "L dépasse le contexte d'entraînement — il faut aussi un RoPE scaling pour encoder les positions aussi loin (voir le Planificateur YaRN).",
+    "launch.warn.no_mmap":         "Toutes les couches tiennent → ajout de --no-mmap pour forcer les poids en VRAM physique (évite le problème illegal-memory / OOM-au-chargement de Blackwell).",
+    "launch.warn.partial":         "Seules certaines couches tiennent sur GPU — le reste tourne sur CPU (bien plus lent). Passez à un quant plus petit ou un contexte plus court pour tout faire tenir.",
+    "launch.warn.cpu_only":        "Aucune couche ne tient avec ces réglages — CPU seul. Utilisez un quant/contexte plus petit ou un GPU plus grand.",
+    "launch.warn.no_params":       "Impossible de lire le nombre de paramètres — la taille des poids est une estimation grossière par géométrie.",
+    "launch.err.no_geom":          "Récupérez d'abord un modèle pour lire sa géométrie.",
+    "launch.err.no_gpu":           "Choisissez un GPU ou saisissez une taille de VRAM personnalisée.",
+    "launch.err.no_ctx":           "Saisissez une longueur de contexte cible L.",
+    "launch.copy":                 "Copier la commande",
+    "help.v094.launch.title":      "🚀 Générateur de flags de lancement",
+    "help.v094.launch.body":       "Les calculateurs de VRAM disent <em>si</em> un modèle tient ; ils ne donnent pas la commande. Ceci si. Choisissez un modèle (récupère la géométrie du <code>config.json</code> HF), un quant, un GPU et un contexte cible — il calcule la ventilation VRAM (poids + KV cache + scratch), combien de couches décharger (<code>-ngl</code>), et émet les commandes à coller <code>llama-server</code> et Ollama avec contexte <code>-c</code>, flash-attention <code>-fa</code>, type de KV-cache, et <code>--no-mmap</code> (le fix OOM Blackwell). Plus le contrôle de réalité TAF qu'aucun calculateur ne donne : si vous allouez du KV pour un contexte au-delà du d_horizon du modèle, il vous avertit que cette mémoire est gâchée. <em>Cas d'usage</em> : 'Quel <code>-ngl</code> pour Llama-70B-Q4 sur mon 4090 ?' → 39 couches sur 80, commande exacte, et une note si le contexte dépasse l'horizon utile.",
     "gguf.title":                  "🧊 Pont de validité GGUF",
     "gguf.tip":                    "<strong>Tenir dans la VRAM ≠ fonctionner</strong>. Les calculateurs GGUF/VRAM lisent les métadonnées d'un modèle pour dire si un quant <em>tient dans le GPU</em>. Ceci lit les MÊMES métadonnées (rope_theta, context_length, schéma de quant, géométrie des têtes) directement depuis l'en-tête <code>.gguf</code> via HTTP Range — sans télécharger des Go — et répond à ce qu'ils n'abordent pas : la qualité d'attention tient-elle vraiment, et de combien le quant l'érode-t-il (γ-shift, ΔPPL) ?",
     "gguf.desc":                   "Collez un dépôt GGUF (ex. <code>Qwen/Qwen2.5-7B-Instruct-GGUF</code>), choisissez un fichier de quant, et obtenez un verdict de qualité TAF : l'horizon d'attention effectif du modèle, plus de combien la quantification choisie décale γ pour <em>cette architecture précise</em>. Ne lit que l'en-tête du fichier dans votre navigateur.",
     "help.privacy.title":       "Confidentialité",
     "help.privacy.body":        "Tout s'exécute dans votre navigateur. Pas de télémétrie, pas d'analytique, pas de données envoyées ailleurs. Même le modèle LLM s'exécute localement via WebGPU/WebAssembly. Vos model_ids et questions ne quittent jamais cette page.",
     "help.source.title":        "Code source et paper",
+    "help.source.body":         "Code : <a href=\"https://github.com/karlesmarin/tafagent\" target=\"_blank\">github.com/karlesmarin/tafagent</a><br>Paper : <em>Marin 2026 — Predicting How Transformers Attend</em> (<a href=\"https://zenodo.org/records/20314038\" target=\"_blank\">Zenodo</a> ; arXiv à venir)<br>Dataset : <a href=\"https://huggingface.co/datasets/karlexmarin/taf-attention-decay\" target=\"_blank\">taf-attention-decay</a> — 58 mesures γ sur 32 modèles (CC-BY-4.0)",
     "footer.text":             "© 2026 Carles Marin · Apache-2.0 · recherche indépendante · l'outil qui ferme la boucle du paper.",
   },
     "mode_desc.yarn":              "生成精确的 rope_scaling 配置以将模型扩展到训练上下文之外 —— 外加 TAF 裁决：在目标长度下注意力质量是否真的撑得住。",
     "modes.gguf":                  "🧊 GGUF 桥",
     "mode_desc.gguf":              "在浏览器内读取 GGUF 文件的元数据头（rope_theta、context_length、量化），给出 TAF 质量裁决 —— 显存计算器跳过的那个问题：塞得进且还能用吗？",
+    "modes.launch":                "🚀 启动参数",
+    "mode_desc.launch":            "模型 + GPU + 上下文 → 精确的 llama.cpp / Ollama 启动命令（-ngl、-c、--no-mmap、KV-cache 类型），附显存明细，以及当上下文超过可用视界时的 TAF 警告。",
+    "launch.title":                "🚀 启动参数生成器",
+    "launch.tip":                  "<strong>精确参数 + 原因，不只是\"塞得进\"</strong>。显存计算器告诉你模型是否塞得进。本工具给你可复制粘贴的 <code>llama.cpp</code> / <code>Ollama</code> 命令 —— <code>-ngl</code> 卸载层数、<code>-c</code> 上下文、<code>--no-mmap</code>、KV-cache 类型 —— 以及 TAF 现实检查：若你为 128K 分配 KV 但模型注意力视界只有 32K，那部分显存就浪费了。",
+    "launch.desc":                 "选择模型、GPU 和目标上下文 → 获得精确启动命令、显存明细（权重 + KV cache + 开销），以及卸载多少层。解决常见的\"该用什么 <code>-ngl</code>？\"/ Blackwell OOM 的猜测。",
+    "launch.model_label":          "HF 模型 id：",
+    "launch.fetch_btn":            "📥 获取几何",
+    "launch.quant_label":          "量化：",
+    "launch.gpu_label":            "GPU：",
+    "launch.ctx_label":            "目标上下文 L：",
+    "launch.adv_label":            "高级：",
+    "launch.cache_label":          "KV cache：",
+    "launch.fa_label":             "Flash attention (-fa)",
+    "launch.gen_btn":              "🚀 生成参数",
+    "launch.need_id":              "输入模型 id，如 'Qwen/Qwen2.5-7B-Instruct'",
+    "launch.fetching":             "正在从 HF Hub 获取 config.json…",
+    "launch.layers":               "层",
+    "launch.fetched_hint":         "选择 GPU + 上下文，然后生成参数。",
+    "launch.need_fetch":           "请先获取模型（📥 获取几何）。",
+    "launch.verdict.fits":         "塞得进 —— 全部在 GPU",
+    "launch.verdict.partial":      "部分 —— 部分层在 CPU（更慢）",
+    "launch.verdict.too_big":      "太大 —— 此 GPU 一层都放不下",
+    "launch.r.weights":            "权重",
+    "launch.r.kv":                 "KV cache",
+    "launch.r.overhead":           "开销 / scratch",
+    "launch.r.total":              "总计",
+    "launch.r.ngl":                "卸载层数 (-ngl)",
+    "launch.r.all":                "全部",
+    "launch.r.note":               "显存为估计值（权重按 bits/参数，KV 按头几何，scratch 粗略）。d_horizon 来自 γ_Padé。请用真实加载核实 —— 留约 1 GB 余量。",
+    "launch.warn.horizon_wasted":  "目标上下文远超模型的注意力视界 —— 超出部分的 KV 内存被浪费；模型不会关注那里。(TAF)",
+    "launch.warn.beyond_trained":  "L 超过训练上下文 —— 还需要 RoPE scaling 才能编码那么远的位置（见 YaRN 规划器）。",
+    "launch.warn.no_mmap":         "所有层都放得下 → 已加 --no-mmap 强制权重进入物理显存（避免 Blackwell 的 illegal-memory / 加载时 OOM 问题）。",
+    "launch.warn.partial":         "只有部分层放进 GPU —— 其余在 CPU 运行（慢得多）。换更小的量化或更短的上下文以完整放入。",
+    "launch.warn.cpu_only":        "这些设置下一层都放不下 —— 仅 CPU。请用更小的量化/上下文或更大的 GPU。",
+    "launch.warn.no_params":       "无法读取参数量 —— 权重大小为按几何的粗略估计。",
+    "launch.err.no_geom":          "请先获取模型以读取其几何。",
+    "launch.err.no_gpu":           "请选择 GPU 或输入自定义显存大小。",
+    "launch.err.no_ctx":           "请输入目标上下文长度 L。",
+    "launch.copy":                 "复制命令",
+    "help.v094.launch.title":      "🚀 启动参数生成器",
+    "help.v094.launch.body":       "显存计算器告诉你模型<em>是否</em>塞得进；它们不给你命令。本工具给。选择一个模型（从 HF <code>config.json</code> 获取几何）、一个量化、一个 GPU 和目标上下文 —— 它计算显存明细（权重 + KV cache + scratch）、卸载多少层（<code>-ngl</code>），并输出可复制粘贴的 <code>llama-server</code> 和 Ollama 命令，带 <code>-c</code> 上下文、<code>-fa</code> flash-attention、KV-cache 类型，以及 <code>--no-mmap</code>（Blackwell OOM 修复）。还有任何计算器都不给的 TAF 现实检查：若你为超过模型 d_horizon 的上下文分配 KV，它会警告你那部分内存被浪费。<em>用例</em>：'我的 4090 上 Llama-70B-Q4 该用什么 <code>-ngl</code>？' → 80 层中的 39 层、精确命令，以及若上下文超过可用视界的提示。",
     "gguf.title":                  "🧊 GGUF 有效性桥",
     "gguf.tip":                    "<strong>塞进显存 ≠ 能用</strong>。GGUF/显存计算器读取模型元数据来告诉你某量化<em>是否塞得进 GPU</em>。本工具通过 HTTP Range 直接从 <code>.gguf</code> 头读取同样的元数据（rope_theta、context_length、量化方案、注意力头几何）—— 无需下载数 GB —— 并回答它们不答的：注意力质量是否真的撑得住，量化又侵蚀了多少（γ-shift、ΔPPL）？",
     "gguf.desc":                   "粘贴一个 GGUF 仓库（如 <code>Qwen/Qwen2.5-7B-Instruct-GGUF</code>），选择一个量化文件，获得 TAF 质量裁决：模型的有效注意力视界，以及所选量化对<em>这个具体架构</em>的 γ 位移有多大。只在浏览器内读取文件头。",
     "help.privacy.title":       "隐私",
     "help.privacy.body":        "一切都在您的浏览器中运行。无遥测,无分析,无数据发送到任何地方。即使是 LLM 模型也通过 WebGPU/WebAssembly 在本地运行。您的 model_ids 和问题永不离开此页面。",
     "help.source.title":        "源代码和论文",
+    "help.source.body":         "源代码: <a href=\"https://github.com/karlesmarin/tafagent\" target=\"_blank\">github.com/karlesmarin/tafagent</a><br>论文: <em>Marin 2026 — Predicting How Transformers Attend</em> (<a href=\"https://zenodo.org/records/20314038\" target=\"_blank\">Zenodo</a>; arXiv 即将)<br>数据集: <a href=\"https://huggingface.co/datasets/karlexmarin/taf-attention-decay\" target=\"_blank\">taf-attention-decay</a> — 32个模型上的58次γ测量 (CC-BY-4.0)",
     "footer.text":             "© 2026 Carles Marin · Apache-2.0 · 独立研究 · 闭合论文回路的工具。",
   },

js/launch_flags.js ADDED Viewed

	@@ -0,0 +1,170 @@

+// Launch-Flag Generator (v0.9.4 anti-bullshit pack)
+//
+// Input a model + GPU + target context → the exact llama.cpp / Ollama launch
+// flags (-ngl layers to offload, -c context, --no-mmap, cache-type), with a
+// VRAM breakdown AND the TAF angle the pure VRAM calculators miss: "you CAN
+// allocate KV for 128K, but this model's attention horizon is ~32K — context
+// past that is wasted memory." Solves the recurring r/LocalLLaMA pain of
+// guessing -ngl / hitting Blackwell OOM. All browser-only.
+import { gammaPade } from "./gamma_check.js";
+import { dHorizon } from "./yarn_planner.js";
+// Curated GPU VRAM presets (GB). Unified-memory Macs included (shared pool).
+export const GPU_PRESETS = [
+  { id: "rtx3060",  label: "RTX 3060 12GB",     vram: 12 },
+  { id: "rtx4060ti",label: "RTX 4060 Ti 16GB",  vram: 16 },
+  { id: "rtx4070",  label: "RTX 4070 12GB",     vram: 12 },
+  { id: "rtx4080",  label: "RTX 4080 16GB",     vram: 16 },
+  { id: "rtx3090",  label: "RTX 3090 24GB",     vram: 24 },
+  { id: "rtx4090",  label: "RTX 4090 24GB",     vram: 24 },
+  { id: "rtx5090",  label: "RTX 5090 32GB",     vram: 32 },
+  { id: "a100_40",  label: "A100 40GB",         vram: 40 },
+  { id: "a100_80",  label: "A100 80GB",         vram: 80 },
+  { id: "h100",     label: "H100 80GB",         vram: 80 },
+  { id: "h200",     label: "H200 141GB",        vram: 141 },
+  { id: "mac32",    label: "Mac 32GB (unified)",vram: 24 },   // ~75% usable for GPU
+  { id: "mac64",    label: "Mac 64GB (unified)",vram: 48 },
+  { id: "mac128",   label: "Mac 128GB (unified)",vram: 96 },
+];
+// Effective bits-per-weight per GGUF quant (includes K-quant block overhead).
+export const QUANT_BPW = {
+  F16:    16.0,
+  Q8_0:    8.5,
+  Q6_K:    6.56,
+  Q5_K_M:  5.67,
+  Q4_K_M:  4.83,
+  Q4_0:    4.55,
+  Q3_K_M:  3.91,
+  Q2_K:    2.63,
+};
+// KV-cache element bytes per cache dtype.
+const CACHE_BYTES = { fp16: 2, q8_0: 1, q4_0: 0.5 };
+const GB = 1024 ** 3;
+// Estimate parameter count from geometry when the model card doesn't state it.
+// Uses the exact decoder layout (attention with GQA + SwiGLU MLP + embeddings)
+// when intermediate_size is known — the 12·h² shortcut undercounts modern
+// large-FFN models (Qwen2.5-7B is really 7.6B, not the ~5.4B the shortcut gives).
+export function estimateNParams({ nParams, hidden, nLayers, vocab, intermediate, nKvHeads, headDim, tieEmbeddings }) {
+  if (Number.isFinite(nParams) && nParams > 0) return nParams;
+  if (!hidden || !nLayers) return null;
+  let perLayer;
+  if (intermediate) {
+    const kvDim = (nKvHeads && headDim) ? nKvHeads * headDim : hidden; // GQA shrinks K,V
+    const attn = 2 * hidden * hidden + 2 * hidden * kvDim;             // q,o + k,v
+    const mlp = 3 * hidden * intermediate;                            // gate,up,down (SwiGLU)
+    perLayer = attn + mlp;
+  } else {
+    perLayer = 12 * hidden * hidden; // fallback heuristic
+  }
+  const embed = vocab ? (tieEmbeddings ? 1 : 2) * vocab * hidden : 0;
+  return perLayer * nLayers + embed;
+}
+// KV cache bytes for the whole model at context L.
+function kvCacheBytes(nLayers, nKvHeads, headDim, L, cacheType) {
+  const elem = CACHE_BYTES[cacheType] ?? 2;
+  return 2 /* K+V */ * nLayers * nKvHeads * headDim * L * elem;
+}
+export function planLaunch(opts) {
+  const {
+    nParams, nLayers, nKvHeads, headDim, hidden, ropeTheta, ctxTrain,
+    quant = "Q4_K_M", vramGB, targetCtx, cacheType = "fp16", flashAttn = true,
+  } = opts;
+  const out = { ok: false, warnings: [] };
+  if (!nLayers || !nKvHeads || !headDim) { out.verdict = "no_geometry"; return out; }
+  if (!Number.isFinite(vramGB) || vramGB <= 0) { out.verdict = "no_gpu"; return out; }
+  if (!Number.isFinite(targetCtx) || targetCtx <= 0) { out.verdict = "no_ctx"; return out; }
+  const bpw = QUANT_BPW[quant] ?? 4.83;
+  const N = estimateNParams({
+    nParams, hidden, nLayers, vocab: opts.vocab,
+    intermediate: opts.intermediate, nKvHeads, headDim, tieEmbeddings: opts.tieEmbeddings,
+  });
+  const weightsB = N ? (N * bpw / 8) : null;
+  const kvB = kvCacheBytes(nLayers, nKvHeads, headDim, targetCtx, cacheType);
+  // Compute/scratch buffer: roughly scales with context × hidden. Flash-attention
+  // shrinks the attention scratch substantially. Coarse estimate, flagged as such.
+  const scratchB = (flashAttn ? 0.25 : 0.6) * GB + (hidden ? 0.5 * hidden * targetCtx * 2 : 0);
+  const overheadB = 0.4 * GB + scratchB;
+  const weightsGB = weightsB != null ? weightsB / GB : null;
+  const kvGB = kvB / GB;
+  const overheadGB = overheadB / GB;
+  const totalGB = (weightsGB ?? 0) + kvGB + overheadGB;
+  // Layer-offload (-ngl). ~88% of weights live in transformer layers; the rest
+  // (embeddings/output) load with any GPU offload.
+  const layerFrac = 0.88;
+  const layerWeightsGB = weightsGB != null ? weightsGB * layerFrac : null;
+  const nonLayerGB = weightsGB != null ? weightsGB * (1 - layerFrac) : 0;
+  const kvPerLayerGB = kvGB / nLayers;
+  const perLayerGB = (layerWeightsGB != null ? layerWeightsGB / nLayers : 0) + kvPerLayerGB;
+  let ngl, allOnGpu, fits;
+  if (weightsGB == null) {
+    ngl = null; allOnGpu = false; fits = false;
+    out.warnings.push({ code: "no_params" });
+  } else if (totalGB <= vramGB) {
+    ngl = nLayers; allOnGpu = true; fits = true;
+  } else {
+    const avail = vramGB - overheadGB - nonLayerGB;
+    ngl = perLayerGB > 0 ? Math.max(0, Math.floor(avail / perLayerGB)) : 0;
+    ngl = Math.min(ngl, nLayers);
+    allOnGpu = false; fits = false;
+  }
+  // TAF horizon: does the model's attention actually reach the context you're
+  // paying KV memory for? This is the differentiator vs pure VRAM calculators.
+  const theta = Number(ropeTheta) || 10000;
+  const gammaTrain = ctxTrain ? gammaPade(theta, ctxTrain) : null;
+  const dHoriz = gammaTrain != null ? dHorizon(theta, gammaTrain) : null;
+  const horizonWasted = dHoriz != null && targetCtx > dHoriz * 1.25;
+  if (horizonWasted) out.warnings.push({ code: "horizon_wasted", params: { dHoriz, target: targetCtx } });
+  if (ctxTrain && targetCtx > ctxTrain) out.warnings.push({ code: "beyond_trained", params: { ctxTrain, target: targetCtx } });
+  if (allOnGpu) out.warnings.push({ code: "no_mmap_blackwell" });
+  if (!fits && ngl > 0) out.warnings.push({ code: "partial_offload", params: { ngl, nLayers } });
+  if (!fits && ngl === 0) out.warnings.push({ code: "cpu_only", params: {} });
+  out.ok = true;
+  Object.assign(out, {
+    verdict: fits ? "fits" : (ngl > 0 ? "partial" : "too_big"),
+    nParams: N, bpw, quant, cacheType, flashAttn,
+    weightsGB, kvGB, overheadGB, totalGB, vramGB,
+    ngl, allOnGpu, nLayers,
+    theta, dHoriz, gammaTrain, ctxTrain, targetCtx,
+  });
+  return out;
+}
+// Build the copy-paste commands for both engines.
+export function launchCommands(plan, modelRef = "<model.gguf>") {
+  const nglStr = plan.allOnGpu ? "99" : String(plan.ngl);
+  const cache = plan.cacheType !== "fp16" ? ` -ctk ${plan.cacheType} -ctv ${plan.cacheType}` : "";
+  const fa = plan.flashAttn ? " -fa" : "";
+  const mmap = plan.allOnGpu ? " --no-mmap" : "";
+  const llamacpp =
+    `llama-server -m ${modelRef} \\\n` +
+    `  -ngl ${nglStr} -c ${plan.targetCtx}${fa}${cache}${mmap}`;
+  // Ollama: Modelfile params + env. num_gpu = layers on GPU.
+  const olEnv = [
+    plan.flashAttn ? "OLLAMA_FLASH_ATTENTION=1" : null,
+    plan.cacheType !== "fp16" ? `OLLAMA_KV_CACHE_TYPE=${plan.cacheType}` : null,
+  ].filter(Boolean).join(" ");
+  const ollama =
+    (olEnv ? olEnv + " \\\n" : "") +
+    `ollama run <model>\n` +
+    `# Modelfile / params:\n` +
+    `PARAMETER num_ctx ${plan.targetCtx}\n` +
+    `PARAMETER num_gpu ${nglStr === "99" ? plan.nLayers : nglStr}`;
+  return { llamacpp, ollama };
+}

js/main.js CHANGED Viewed

@@ -40,6 +40,7 @@ import {
 } from "./longscore.js";
 import { planExtension, suggestRopeType } from "./yarn_planner.js";
 import { listGgufFiles, fetchGgufMetadata, ggufToConfig, quantFromFilename, analyzeGguf } from "./gguf_bridge.js";
 // Attach HF Hub search-as-you-type to all 5 model id inputs (Profile, Recipe,
 // Unmask, Template, Quant). Hits public huggingface.co/api/models. Idempotent.
@@ -235,6 +236,7 @@ document.addEventListener("click", (e) => {
       hub: "hub-section",
       yarn: "yarn-section",
       gguf: "gguf-section",
     }[targetMode];
     if (sectionId) {
       const sec = document.getElementById(sectionId);
@@ -259,7 +261,7 @@ document.querySelectorAll(".mode-btn").forEach(btn => {
      "diagnose-section", "phase-section", "unmask-section",
      "template-section", "arena-section", "contam-section",
      "quant-section", "drift-section", "niah-section",
-     "saturation-section", "cot-section", "peft-section", "cache-section", "speculative-section", "tax-section", "longscore-section", "hub-section", "yarn-section", "gguf-section"].forEach(id => {
       const el = $(id);
       if (el) el.style.display = "none";
     });
@@ -280,6 +282,7 @@ document.querySelectorAll(".mode-btn").forEach(btn => {
       hub: "hub-section",
       yarn: "yarn-section",
       gguf: "gguf-section",
     };
     const sectionId = sectionMap[mode];
     if (sectionId) $(sectionId).style.display = "";
@@ -295,6 +298,7 @@ document.querySelectorAll(".mode-btn").forEach(btn => {
     if (mode === "hub") initHub();
     if (mode === "yarn") initYarn();
     if (mode === "gguf") initGguf();
   });
 });
@@ -4951,6 +4955,127 @@ function renderGgufComparison(cfg, rows) {
     <p class="subtle" style="font-size:0.88em;">${t("gguf.r.note")}</p>`;
 }
 // ════════════════════════════════════════════════════════════════════
 // Bootstrap
 // ════════════════════════════════════════════════════════════════════

 } from "./longscore.js";
 import { planExtension, suggestRopeType } from "./yarn_planner.js";
 import { listGgufFiles, fetchGgufMetadata, ggufToConfig, quantFromFilename, analyzeGguf } from "./gguf_bridge.js";
+import { GPU_PRESETS, QUANT_BPW, planLaunch, launchCommands } from "./launch_flags.js";
 // Attach HF Hub search-as-you-type to all 5 model id inputs (Profile, Recipe,
 // Unmask, Template, Quant). Hits public huggingface.co/api/models. Idempotent.
       hub: "hub-section",
       yarn: "yarn-section",
       gguf: "gguf-section",
+      launch: "launch-section",
     }[targetMode];
     if (sectionId) {
       const sec = document.getElementById(sectionId);
      "diagnose-section", "phase-section", "unmask-section",
      "template-section", "arena-section", "contam-section",
      "quant-section", "drift-section", "niah-section",
+     "saturation-section", "cot-section", "peft-section", "cache-section", "speculative-section", "tax-section", "longscore-section", "hub-section", "yarn-section", "gguf-section", "launch-section"].forEach(id => {
       const el = $(id);
       if (el) el.style.display = "none";
     });
       hub: "hub-section",
       yarn: "yarn-section",
       gguf: "gguf-section",
+      launch: "launch-section",
     };
     const sectionId = sectionMap[mode];
     if (sectionId) $(sectionId).style.display = "";
     if (mode === "hub") initHub();
     if (mode === "yarn") initYarn();
     if (mode === "gguf") initGguf();
+    if (mode === "launch") initLaunch();
   });
 });
     <p class="subtle" style="font-size:0.88em;">${t("gguf.r.note")}</p>`;
 }
+// ════════════════════════════════════════════════════════════════════
+// 🚀 Launch-Flag Generator (v0.9.4)
+// ════════════════════════════════════════════════════════════════════
+let _launchWired = false;
+let _launchGeom = null; // fetched model geometry
+function initLaunch() {
+  if (_launchWired) return;
+  _launchWired = true;
+  // Populate GPU presets.
+  const gpuSel = $("launch-gpu");
+  if (gpuSel && !gpuSel.options.length) {
+    gpuSel.innerHTML = GPU_PRESETS.map(g => `<option value="${g.vram}">${escapeHtml(g.label)}</option>`).join("");
+    gpuSel.value = "24"; // sensible default (4090)
+  }
+  const fetchBtn = $("launch-fetch-btn");
+  const modelEl = $("launch-model");
+  // Picking from autocomplete auto-fetches geometry (matches the other modes).
+  if (modelEl) attachHfAutocomplete(modelEl, { onSelect: () => fetchBtn?.click() });
+  fetchBtn?.addEventListener("click", async () => {
+    const id = (modelEl.value || "").trim();
+    if (!id) { $("launch-status").textContent = "⚠ " + t("launch.need_id"); return; }
+    $("launch-status").textContent = "⏳ " + t("launch.fetching");
+    fetchBtn.disabled = true;
+    state.lastModelId = id;
+    try {
+      const cfg = await fetchHfConfig(id);
+      const nAttn = cfg.num_attention_heads ?? null;
+      const rs = (cfg.rope_scaling && typeof cfg.rope_scaling === "object") ? cfg.rope_scaling : {};
+      _launchGeom = {
+        nLayers: cfg.num_hidden_layers ?? null,
+        nKvHeads: cfg.num_key_value_heads ?? nAttn,
+        headDim: cfg.head_dim ?? (cfg.hidden_size && nAttn ? cfg.hidden_size / nAttn : null),
+        hidden: cfg.hidden_size ?? null,
+        vocab: cfg.vocab_size ?? null,
+        intermediate: cfg.intermediate_size ?? null,
+        tieEmbeddings: cfg.tie_word_embeddings ?? false,
+        nParams: cfg.num_parameters ?? null,
+        ropeTheta: cfg.rope_theta ?? 10000,
+        ctxTrain: rs.original_max_position_embeddings ?? cfg.max_position_embeddings ?? null,
+      };
+      if (!$("launch-ctx").value && _launchGeom.ctxTrain) $("launch-ctx").value = _launchGeom.ctxTrain;
+      const via = cfg.__via_mirror ? ` (via ${escapeHtml(cfg.__via_mirror)})` : "";
+      $("launch-status").innerHTML = `✅ <strong>${escapeHtml(id)}</strong>${via}: ${_launchGeom.nLayers} ${t("launch.layers")}, ` +
+        `GQA ${nAttn}:${_launchGeom.nKvHeads}, θ=${_thetaFmt(_launchGeom.ropeTheta)}, ctx ${_yarnFmtK(_launchGeom.ctxTrain)}. ${t("launch.fetched_hint")}`;
+    } catch (err) {
+      $("launch-status").textContent = `❌ ${err.message}`;
+    } finally {
+      fetchBtn.disabled = false;
+    }
+  });
+  $("launch-gen-btn")?.addEventListener("click", () => {
+    if (!_launchGeom) { $("launch-status").textContent = "⚠ " + t("launch.need_fetch"); return; }
+    const vram = parseFloat($("launch-vram").value) || parseFloat(gpuSel.value);
+    const plan = planLaunch({
+      ..._launchGeom,
+      quant: $("launch-quant").value,
+      vramGB: vram,
+      targetCtx: parseFloat($("launch-ctx").value),
+      cacheType: $("launch-cache").value,
+      flashAttn: $("launch-fa").checked,
+    });
+    renderLaunch(plan);
+  });
+}
+function _launchWarnText(w) {
+  switch (w.code) {
+    case "horizon_wasted":   return `${t("launch.warn.horizon_wasted")} (d_horizon ≈ ${_yarnFmtK(w.params.dHoriz)}, L=${_yarnFmtK(w.params.target)})`;
+    case "beyond_trained":   return `${t("launch.warn.beyond_trained")} (${_yarnFmtK(w.params.ctxTrain)} → ${_yarnFmtK(w.params.target)})`;
+    case "no_mmap_blackwell":return t("launch.warn.no_mmap");
+    case "partial_offload":  return `${t("launch.warn.partial")} (${w.params.ngl}/${w.params.nLayers})`;
+    case "cpu_only":         return t("launch.warn.cpu_only");
+    case "no_params":        return t("launch.warn.no_params");
+    default: return w.code;
+  }
+}
+function renderLaunch(p) {
+  const out = $("launch-output");
+  if (!out) return;
+  out.style.display = "";
+  const errMap = { no_geometry: "launch.err.no_geom", no_gpu: "launch.err.no_gpu", no_ctx: "launch.err.no_ctx" };
+  if (errMap[p.verdict]) { out.innerHTML = `<div class="gc-validity-warning">⚠ ${t(errMap[p.verdict])}</div>`; return; }
+  const meta = ({
+    fits:    { emoji: "✅", cls: "v-yes" },
+    partial: { emoji: "⚠️", cls: "v-deg" },
+    too_big: { emoji: "🚨", cls: "v-no"  },
+  })[p.verdict] || { emoji: "❓", cls: "v-deg" };
+  const cmds = launchCommands(p);
+  const td = "padding:3px 12px 3px 0;";
+  const gb = n => (n == null ? "—" : n.toFixed(1) + " GB");
+  const warnHtml = p.warnings.map(w => `<li>${_launchWarnText(w)}</li>`).join("");
+  out.innerHTML = `
+    <p><span class="verdict-badge ${meta.cls}">${meta.emoji} ${t("launch.verdict." + p.verdict)}</span></p>
+    <table style="border-collapse:collapse;font-size:0.95em;margin:0.5em 0;">
+      <tr><td style="${td}">${t("launch.r.weights")}</td><td>${gb(p.weightsGB)} <span class="subtle">(${p.quant}, ${p.bpw} bpw)</span></td></tr>
+      <tr><td style="${td}">${t("launch.r.kv")}</td><td>${gb(p.kvGB)} <span class="subtle">(${p.cacheType}${p.flashAttn ? ", -fa" : ""})</span></td></tr>
+      <tr><td style="${td}">${t("launch.r.overhead")}</td><td>${gb(p.overheadGB)}</td></tr>
+      <tr style="border-top:1px solid var(--border);"><td style="${td}"><strong>${t("launch.r.total")}</strong></td><td><strong>${gb(p.totalGB)}</strong> / ${gb(p.vramGB)} VRAM</td></tr>
+      <tr><td style="${td}">${t("launch.r.ngl")}</td><td><strong>${p.allOnGpu ? `${p.nLayers} (${t("launch.r.all")})` : `${p.ngl} / ${p.nLayers}`}</strong></td></tr>
+    </table>
+    <h3>llama.cpp</h3>
+    <pre class="diag-cmd-box">${escapeHtml(cmds.llamacpp)}</pre>
+    <button id="launch-copy-llama" class="secondary">📋 ${t("launch.copy")}</button>
+    <h3 style="margin-top:0.8em;">Ollama</h3>
+    <pre class="diag-cmd-box">${escapeHtml(cmds.ollama)}</pre>
+    ${warnHtml ? `<ul style="font-size:0.9em;margin-top:0.8em;opacity:0.9;">${warnHtml}</ul>` : ""}
+    <p class="subtle" style="font-size:0.86em;">${t("launch.r.note")}</p>`;
+  $("launch-copy-llama")?.addEventListener("click", async () => {
+    try { await navigator.clipboard.writeText(cmds.llamacpp); $("launch-copy-llama").textContent = "✓ " + t("yarn.copied"); } catch (e) {}
+  });
+}
 // ════════════════════════════════════════════════════════════════════
 // Bootstrap
 // ════════════════════════════════════════════════════════════════════

registry-bootstrap/README.md CHANGED Viewed

@@ -157,7 +157,7 @@ unless otherwise noted by the contributor. The TAF Agent code itself is
 - 🔬 [TAF Agent web tool](https://karlesmarin.github.io/tafagent) — the diagnostic itself
 - 📦 [TAF Agent source](https://github.com/karlesmarin/tafagent) — open source
-- 📄 [Underlying paper](https://zenodo.org/records/19826343) — Marin 2026,
   *Predicting How Transformers Attend*
 ---

 - 🔬 [TAF Agent web tool](https://karlesmarin.github.io/tafagent) — the diagnostic itself
 - 📦 [TAF Agent source](https://github.com/karlesmarin/tafagent) — open source
+- 📄 [Underlying paper](https://zenodo.org/records/20314038) — Marin 2026,
   *Predicting How Transformers Attend*
 ---

test_launch.mjs ADDED Viewed

	@@ -0,0 +1,77 @@

+import { chromium } from "playwright";
+const b = await chromium.launch({ headless: true });
+const p = await (await b.newContext()).newPage();
+const errors=[]; const benign=s=>/40\d/.test(s);
+p.on("console",m=>{if(m.type()==="error"&&!benign(m.text()))errors.push("[c]"+m.text());});
+p.on("pageerror",e=>errors.push("[pe]"+e.message));
+const log=s=>process.stdout.write(s+"\n"); let pass=0,fail=0;
+const check=(n,c,x="")=>{log(`${c?"  OK  ":"  FAIL"} ${n} ${x}`);c?pass++:fail++;};
+await p.goto("http://127.0.0.1:8000/index.html",{waitUntil:"domcontentloaded",timeout:90000});
+await p.waitForTimeout(2500);
+await p.click(`.lang-btn[data-lang="en"]`); await p.waitForTimeout(200);
+check("module loads, 0 errors", errors.length===0, `(${errors.length})`);
+await p.click('[data-mode-link="launch"]',{timeout:5000}); await p.waitForTimeout(400);
+check("section visible", await p.evaluate(()=>{const s=document.querySelector("#launch-section");return s&&getComputedStyle(s).display!=="none";}));
+check("GPU presets populated", await p.evaluate(()=>document.querySelector("#launch-gpu").options.length>5));
+log("\n── Fetch geometry ──");
+await p.fill("#launch-model","Qwen/Qwen2.5-7B-Instruct");
+await p.keyboard.press("Escape");
+await p.click("#launch-fetch-btn"); await p.waitForTimeout(3500);
+const st=await p.evaluate(()=>document.querySelector("#launch-status").innerText);
+check("geometry fetched (layers/GQA shown)", /layers|GQA|θ=/.test(st), st.slice(0,70));
+check("ctx auto-filled", await p.evaluate(()=>!!document.querySelector("#launch-ctx").value));
+async function gen({quant,gpu,vram,ctx,cache,fa}){
+  if(quant) await p.selectOption("#launch-quant",quant);
+  if(gpu) await p.selectOption("#launch-gpu",gpu);
+  await p.fill("#launch-vram",vram!=null?String(vram):"");
+  if(ctx!=null) await p.fill("#launch-ctx",String(ctx));
+  if(cache) await p.selectOption("#launch-cache",cache);
+  if(fa!=null){const c=await p.isChecked("#launch-fa"); if(c!==fa) await p.click("#launch-fa");}
+  await p.click("#launch-gen-btn"); await p.waitForTimeout(300);
+  return p.evaluate(()=>{const o=document.querySelector("#launch-output");return{
+    verdict:o.querySelector(".verdict-badge")?.innerText?.trim()||"", text:o.innerText};});
+}
+log("\n── FITS case (7B Q4 on 24GB) ──");
+let r=await gen({quant:"Q4_K_M",gpu:"24",vram:null,ctx:32768,cache:"fp16",fa:true});
+check("verdict FITS", /FITS/.test(r.verdict), r.verdict);
+check("ngl = all layers", /all|28/.test(r.text));
+check("llama-server cmd present", /llama-server/.test(r.text));
+check("ollama cmd present", /ollama|num_ctx/.test(r.text));
+check("--no-mmap added when all-on-GPU", /--no-mmap/.test(r.text));
+check("-fa present", /-fa/.test(r.text));
+check("VRAM breakdown (weights/KV)", /Weights|KV cache/.test(r.text));
+log("\n── PARTIAL case (7B Q4 on tiny 3GB custom) ──");
+r=await gen({quant:"Q4_K_M",vram:3,ctx:8192,fa:true});
+check("verdict PARTIAL or TOO BIG", /PARTIAL|TOO BIG/.test(r.verdict), r.verdict);
+check("partial offload warning or cpu-only", /CPU|layers fit|smaller quant/i.test(r.text));
+log("\n── cache quant changes KV flag ──");
+r=await gen({quant:"Q4_K_M",gpu:"24",vram:null,ctx:32768,cache:"q8_0",fa:true});
+check("KV cache q8_0 → -ctk/-ctv in cmd", /-ctk q8_0/.test(r.text));
+log("\n── beyond-trained warning ──");
+r=await gen({quant:"Q4_K_M",gpu:"80",vram:null,ctx:262144,cache:"fp16",fa:true});
+check("L beyond trained → warning", /trained|RoPE|YaRN/i.test(r.text), "L=256K");
+log("\n── error: generate before fetch (fresh) ──");
+// can't easily un-fetch; just check error key exists by clearing geom via reload-free path is hard; skip
+log("\n── 4 languages ──");
+for(const lang of ["es","fr","zh","en"]){
+  await p.click(`.lang-btn[data-lang="${lang}"]`); await p.waitForTimeout(250);
+  const lbl=await p.evaluate(()=>document.querySelector('.mode-btn[data-mode="launch"]')?.textContent?.trim());
+  check(`${lang}: tab label`, lbl&&lbl.length>3, lbl);
+}
+check("copy button present", await p.evaluate(()=>!!document.querySelector("#launch-copy-llama")));
+log(`\n=== ${pass} passed, ${fail} failed · JS errors: ${errors.length} ===`);
+errors.slice(0,10).forEach(e=>log(e));
+await b.close();
+process.exit(fail>0?1:0);