karlexmarin Claude Opus 4.7 (1M context) commited on
Commit
12e81e6
·
1 Parent(s): 22784b8

v0.9.4: Launch-Flag Generator mode + Zenodo record update

Browse files

Launch-Flag Generator: model + GPU + context → the exact llama.cpp / Ollama
launch command, the question the VRAM calculators don't answer (they say
"fits", not "here's the command").

- js/launch_flags.js: VRAM model (weights from bits/param via exact decoder
param count — attention+SwiGLU+embeddings with GQA, not the 12·h² shortcut
that undercounts large-FFN models like Qwen2.5-7B; KV from head geometry;
coarse scratch). Computes -ngl layer offload, fit verdict, and the TAF
horizon check: warns when target context is past d_horizon (KV memory
wasted). launchCommands() emits llama-server + Ollama snippets with -c, -fa,
-ctk/-ctv, --no-mmap (Blackwell OOM fix).
- index.html: tab + tile + #launch-section (GPU presets, quant, cache, FA) +
help v0.9.4. main.js: import, wiring, autocomplete auto-fetch, render.
- i18n.js: full EN/ES/FR/ZH.

Also: updated the paper Zenodo link 19826343 → 20314038 across the app
(index.html, i18n.js 4 langs) and tracked docs/README citations.

Test (test_launch.mjs): 21/21 — fetch geometry, FITS/PARTIAL verdicts,
--no-mmap on full offload, -ctk on cache quant, beyond-trained warning, 4
languages. 25 modes total, 0 JS errors.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

README.md CHANGED
@@ -46,7 +46,7 @@ language:
46
 
47
  **🌐 Live**: https://karlesmarin.github.io/tafagent · HF Space: https://huggingface.co/spaces/karlexmarin/taf-agent
48
  **📦 Source**: https://github.com/karlesmarin/tafagent · Lean repo: https://github.com/karlesmarin/lean-taf
49
- **📄 Paper**: [Predicting How Transformers Attend — Marin 2026](https://zenodo.org/records/19826343)
50
  **🗂️ Dataset**: [taf-attention-decay (58 measurements, 32 models)](https://huggingface.co/datasets/karlexmarin/taf-attention-decay)
51
 
52
  ---
@@ -413,7 +413,7 @@ If this tool helps you — paper or code:
413
  Analytic Power-Law Theory, Phase Transitions, and Practical Compression
414
  Tools},
415
  year = {2026},
416
- url = {https://zenodo.org/records/19826343},
417
  }
418
 
419
  @misc{marin2026tafagent,
 
46
 
47
  **🌐 Live**: https://karlesmarin.github.io/tafagent · HF Space: https://huggingface.co/spaces/karlexmarin/taf-agent
48
  **📦 Source**: https://github.com/karlesmarin/tafagent · Lean repo: https://github.com/karlesmarin/lean-taf
49
+ **📄 Paper**: [Predicting How Transformers Attend — Marin 2026](https://zenodo.org/records/20314038)
50
  **🗂️ Dataset**: [taf-attention-decay (58 measurements, 32 models)](https://huggingface.co/datasets/karlexmarin/taf-attention-decay)
51
 
52
  ---
 
413
  Analytic Power-Law Theory, Phase Transitions, and Practical Compression
414
  Tools},
415
  year = {2026},
416
+ url = {https://zenodo.org/records/20314038},
417
  }
418
 
419
  @misc{marin2026tafagent,
docs/hf-post-v053-fix.md CHANGED
@@ -156,5 +156,5 @@ If you spot anything else wrong — please open an issue.
156
  **Links**:
157
  - Live: https://huggingface.co/spaces/karlexmarin/taf-agent
158
  - Source: https://github.com/karlesmarin/tafagent
159
- - Paper: https://zenodo.org/records/19826343
160
  - Dataset: https://huggingface.co/datasets/karlexmarin/taf-attention-decay
 
156
  **Links**:
157
  - Live: https://huggingface.co/spaces/karlexmarin/taf-agent
158
  - Source: https://github.com/karlesmarin/tafagent
159
+ - Paper: https://zenodo.org/records/20314038
160
  - Dataset: https://huggingface.co/datasets/karlexmarin/taf-attention-decay
hf-post-announcement.md CHANGED
@@ -5,7 +5,7 @@ No server, no auth, no cost. Runs entirely in your browser.
5
 
6
  🌐 **Try it**: https://huggingface.co/spaces/karlexmarin/taf-agent
7
  📦 **Source**: https://github.com/karlesmarin/tafagent
8
- 📄 **Paper**: [Predicting How Transformers Attend](https://zenodo.org/records/19826343)
9
 
10
  ## What it answers
11
 
 
5
 
6
  🌐 **Try it**: https://huggingface.co/spaces/karlexmarin/taf-agent
7
  📦 **Source**: https://github.com/karlesmarin/tafagent
8
+ 📄 **Paper**: [Predicting How Transformers Attend](https://zenodo.org/records/20314038)
9
 
10
  ## What it answers
11
 
hf-space-readme.md CHANGED
@@ -66,7 +66,7 @@ Predicts practical viability of any transformer LLM from its config alone:
66
 
67
  ## Underlying paper
68
 
69
- [Marin 2026 — Predicting How Transformers Attend](https://zenodo.org/records/19826343)
70
 
71
  ## Source
72
 
 
66
 
67
  ## Underlying paper
68
 
69
+ [Marin 2026 — Predicting How Transformers Attend](https://zenodo.org/records/20314038)
70
 
71
  ## Source
72
 
index.html CHANGED
@@ -249,6 +249,9 @@
249
  <p><strong data-i18n="help.v091.gguf.title">🧊 GGUF Validity Bridge</strong></p>
250
  <p data-i18n="help.v091.gguf.body">The dozen GGUF/VRAM calculators (NyxKrage, oobabooga, …) read a <code>.gguf</code> header to tell you if a quant <em>fits in your GPU</em>. This reads the same header — via HTTP Range, so no multi-GB download — and answers the question they skip: <em>does it fit AND still work?</em> Paste a GGUF repo, pick a quant file; the bridge pulls <code>rope_theta</code>, <code>context_length</code>, the quant scheme (from <code>general.file_type</code> or the filename), and head geometry, then runs TAF's γ_Padé / d_horizon plus the architecture-aware quant-regime γ-shift. Output: effective attention horizon at the trained context, how far the quant erodes γ (and ΔPPL) for <em>this</em> model, and a verdict — HEALTHY / USABLE-WITH-CARE / DEGRADES. <em>Use case</em>: 'unsloth/Qwen3.5-9B-GGUF Q4_K_M fits 8GB — but is it brain-dead past 30K?' → see the horizon and the Q4 γ-penalty before you download 6 GB.</p>
251
 
 
 
 
252
  <h3 data-i18n="help.audit.title">The audit chain</h3>
253
  <p data-i18n="help.audit.body">Every result shows the full <strong>Computation Chain</strong> — each formula step with its inputs,
254
  output, and interpretation. Click any step to expand. Cite section numbers (§26.1, §19.1, etc.) refer
@@ -282,7 +285,7 @@
282
 
283
  <h3 data-i18n="help.source.title">Source &amp; paper</h3>
284
  <p data-i18n="help.source.body">Source code: <a href="https://github.com/karlesmarin/tafagent" target="_blank">github.com/karlesmarin/tafagent</a><br>
285
- Paper: <em>Marin 2026 — Predicting How Transformers Attend</em> (<a href="https://zenodo.org/records/19826343" target="_blank">Zenodo</a>; arXiv forthcoming)<br>
286
  Dataset: <a href="https://huggingface.co/datasets/karlexmarin/taf-attention-decay" target="_blank">taf-attention-decay</a> — 58 γ-measurements across 32 models (CC-BY-4.0)</p>
287
  </div>
288
  </div>
@@ -412,6 +415,7 @@
412
  <button data-mode-link="quant" data-i18n="modes.quant">⚖️ Quant</button>
413
  <button data-mode-link="yarn" data-i18n="modes.yarn">🧵 YaRN Planner</button>
414
  <button data-mode-link="gguf" data-i18n="modes.gguf">🧊 GGUF Bridge</button>
 
415
  <button data-mode-link="inspector" data-i18n="modes.inspector">🔍 Inspect config</button>
416
  </div>
417
  </div>
@@ -508,6 +512,7 @@
508
  <button class="mode-btn" data-mode="hub" role="tab" aria-selected="false" data-i18n="modes.hub">🧭 Solutions</button>
509
  <button class="mode-btn" data-mode="yarn" role="tab" aria-selected="false" data-i18n="modes.yarn">🧵 YaRN Planner</button>
510
  <button class="mode-btn" data-mode="gguf" role="tab" aria-selected="false" data-i18n="modes.gguf">🧊 GGUF Bridge</button>
 
511
  </div>
512
  <p id="mode-desc" class="recipe-desc" data-i18n="modes.desc">
513
  <strong>Quickest start</strong>: paste any HuggingFace model id (e.g. <code>meta-llama/Meta-Llama-3-8B</code>),
@@ -1333,6 +1338,69 @@
1333
  <div id="gguf-output" style="display:none; margin-top:1em;"></div>
1334
  </section>
1335
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1336
  <!-- Recipe selector (mode=recipe) -->
1337
  <section id="recipe-section" style="display:none;">
1338
  <h2 data-i18n="recipe.title">📋 Recipe</h2>
 
249
  <p><strong data-i18n="help.v091.gguf.title">🧊 GGUF Validity Bridge</strong></p>
250
  <p data-i18n="help.v091.gguf.body">The dozen GGUF/VRAM calculators (NyxKrage, oobabooga, …) read a <code>.gguf</code> header to tell you if a quant <em>fits in your GPU</em>. This reads the same header — via HTTP Range, so no multi-GB download — and answers the question they skip: <em>does it fit AND still work?</em> Paste a GGUF repo, pick a quant file; the bridge pulls <code>rope_theta</code>, <code>context_length</code>, the quant scheme (from <code>general.file_type</code> or the filename), and head geometry, then runs TAF's γ_Padé / d_horizon plus the architecture-aware quant-regime γ-shift. Output: effective attention horizon at the trained context, how far the quant erodes γ (and ΔPPL) for <em>this</em> model, and a verdict — HEALTHY / USABLE-WITH-CARE / DEGRADES. <em>Use case</em>: 'unsloth/Qwen3.5-9B-GGUF Q4_K_M fits 8GB — but is it brain-dead past 30K?' → see the horizon and the Q4 γ-penalty before you download 6 GB.</p>
251
 
252
+ <p><strong data-i18n="help.v094.launch.title">🚀 Launch-Flag Generator</strong></p>
253
+ <p data-i18n="help.v094.launch.body">The VRAM calculators tell you <em>whether</em> a model fits; they don't hand you the command. This does. Pick a model (fetches geometry from HF <code>config.json</code>), a quant, a GPU and a target context — it computes the VRAM breakdown (weights + KV cache + scratch), how many layers to offload (<code>-ngl</code>), and emits the copy-paste <code>llama-server</code> and Ollama commands with <code>-c</code> context, <code>-fa</code> flash-attention, KV-cache type, and <code>--no-mmap</code> (the Blackwell OOM fix: force all weights into physical VRAM). Plus the TAF reality check no calculator gives: if you're allocating KV for a context past the model's d_horizon, it warns you that memory is wasted — the attention won't reach there. <em>Use case</em>: 'What <code>-ngl</code> for Llama-70B-Q4 on my 4090?' → 39 of 80 layers, exact command, and a note if your context is past the usable horizon.</p>
254
+
255
  <h3 data-i18n="help.audit.title">The audit chain</h3>
256
  <p data-i18n="help.audit.body">Every result shows the full <strong>Computation Chain</strong> — each formula step with its inputs,
257
  output, and interpretation. Click any step to expand. Cite section numbers (§26.1, §19.1, etc.) refer
 
285
 
286
  <h3 data-i18n="help.source.title">Source &amp; paper</h3>
287
  <p data-i18n="help.source.body">Source code: <a href="https://github.com/karlesmarin/tafagent" target="_blank">github.com/karlesmarin/tafagent</a><br>
288
+ Paper: <em>Marin 2026 — Predicting How Transformers Attend</em> (<a href="https://zenodo.org/records/20314038" target="_blank">Zenodo</a>; arXiv forthcoming)<br>
289
  Dataset: <a href="https://huggingface.co/datasets/karlexmarin/taf-attention-decay" target="_blank">taf-attention-decay</a> — 58 γ-measurements across 32 models (CC-BY-4.0)</p>
290
  </div>
291
  </div>
 
415
  <button data-mode-link="quant" data-i18n="modes.quant">⚖️ Quant</button>
416
  <button data-mode-link="yarn" data-i18n="modes.yarn">🧵 YaRN Planner</button>
417
  <button data-mode-link="gguf" data-i18n="modes.gguf">🧊 GGUF Bridge</button>
418
+ <button data-mode-link="launch" data-i18n="modes.launch">🚀 Launch Flags</button>
419
  <button data-mode-link="inspector" data-i18n="modes.inspector">🔍 Inspect config</button>
420
  </div>
421
  </div>
 
512
  <button class="mode-btn" data-mode="hub" role="tab" aria-selected="false" data-i18n="modes.hub">🧭 Solutions</button>
513
  <button class="mode-btn" data-mode="yarn" role="tab" aria-selected="false" data-i18n="modes.yarn">🧵 YaRN Planner</button>
514
  <button class="mode-btn" data-mode="gguf" role="tab" aria-selected="false" data-i18n="modes.gguf">🧊 GGUF Bridge</button>
515
+ <button class="mode-btn" data-mode="launch" role="tab" aria-selected="false" data-i18n="modes.launch">🚀 Launch Flags</button>
516
  </div>
517
  <p id="mode-desc" class="recipe-desc" data-i18n="modes.desc">
518
  <strong>Quickest start</strong>: paste any HuggingFace model id (e.g. <code>meta-llama/Meta-Llama-3-8B</code>),
 
1338
  <div id="gguf-output" style="display:none; margin-top:1em;"></div>
1339
  </section>
1340
 
1341
+ <!-- Launch-flag generator (mode=launch) -->
1342
+ <section id="launch-section" style="display:none;">
1343
+ <h2><span data-i18n="launch.title">🚀 Launch-Flag Generator</span>
1344
+ <span class="info"><span class="tooltip" data-i18n="launch.tip">
1345
+ <strong>Exact flags + why, not just "fits"</strong>. The VRAM calculators tell you whether a
1346
+ model fits. This gives you the copy-paste <code>llama.cpp</code> / <code>Ollama</code> command —
1347
+ <code>-ngl</code> layers to offload, <code>-c</code> context, <code>--no-mmap</code>,
1348
+ KV-cache type — AND the TAF reality check: if you allocate KV for 128K but the model's
1349
+ attention horizon is 32K, that VRAM is wasted.
1350
+ </span></span>
1351
+ </h2>
1352
+ <p class="recipe-desc" data-i18n="launch.desc">
1353
+ Pick a model, GPU and target context → get the exact launch command, a VRAM breakdown
1354
+ (weights + KV cache + overhead), and how many layers to offload. Solves the recurring
1355
+ "what <code>-ngl</code> do I use?" / Blackwell OOM guesswork.
1356
+ </p>
1357
+
1358
+ <div class="form-row">
1359
+ <label for="launch-model" data-i18n="launch.model_label">HF model id:</label>
1360
+ <input type="text" id="launch-model" placeholder="Qwen/Qwen2.5-7B-Instruct">
1361
+ <button id="launch-fetch-btn" class="secondary" data-i18n="launch.fetch_btn">📥 Fetch geometry</button>
1362
+ </div>
1363
+ <span id="launch-status" class="subtle"></span>
1364
+
1365
+ <div class="form-row">
1366
+ <label for="launch-quant" data-i18n="launch.quant_label">Quant:</label>
1367
+ <select id="launch-quant">
1368
+ <option value="Q4_K_M">Q4_K_M (4-bit, sweet spot)</option>
1369
+ <option value="Q8_0">Q8_0 (8-bit)</option>
1370
+ <option value="Q6_K">Q6_K</option>
1371
+ <option value="Q5_K_M">Q5_K_M</option>
1372
+ <option value="Q4_0">Q4_0</option>
1373
+ <option value="Q3_K_M">Q3_K_M</option>
1374
+ <option value="Q2_K">Q2_K (extreme)</option>
1375
+ <option value="F16">F16 (full)</option>
1376
+ </select>
1377
+ </div>
1378
+ <div class="form-row">
1379
+ <label for="launch-gpu" data-i18n="launch.gpu_label">GPU:</label>
1380
+ <select id="launch-gpu"></select>
1381
+ <input type="number" id="launch-vram" placeholder="or custom VRAM (GB)" min="1" style="width:11em;">
1382
+ </div>
1383
+ <div class="form-row">
1384
+ <label for="launch-ctx" data-i18n="launch.ctx_label">Target context L:</label>
1385
+ <input type="number" id="launch-ctx" placeholder="32768" min="256">
1386
+ </div>
1387
+ <div class="form-row">
1388
+ <label data-i18n="launch.adv_label">Advanced:</label>
1389
+ <span>
1390
+ <label data-i18n="launch.cache_label">KV cache:</label>
1391
+ <select id="launch-cache">
1392
+ <option value="fp16">fp16</option>
1393
+ <option value="q8_0">q8_0 (½ KV)</option>
1394
+ <option value="q4_0">q4_0 (¼ KV)</option>
1395
+ </select>
1396
+ &nbsp;
1397
+ <label><input type="checkbox" id="launch-fa" checked> <span data-i18n="launch.fa_label">Flash attention (-fa)</span></label>
1398
+ </span>
1399
+ </div>
1400
+ <button id="launch-gen-btn" data-i18n="launch.gen_btn">🚀 Generate flags</button>
1401
+ <div id="launch-output" style="display:none; margin-top:1em;"></div>
1402
+ </section>
1403
+
1404
  <!-- Recipe selector (mode=recipe) -->
1405
  <section id="recipe-section" style="display:none;">
1406
  <h2 data-i18n="recipe.title">📋 Recipe</h2>
js/i18n.js CHANGED
@@ -429,6 +429,47 @@ export const TRANSLATIONS = {
429
  "mode_desc.yarn": "Generate the exact rope_scaling config to extend a model past its trained context — plus a TAF verdict on whether attention quality actually holds at the target length.",
430
  "modes.gguf": "🧊 GGUF Bridge",
431
  "mode_desc.gguf": "Read a GGUF file's metadata header (rope_theta, context_length, quant) in your browser and get a TAF quality verdict — the question the VRAM calculators skip: fits AND works?",
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
432
  "gguf.title": "🧊 GGUF Validity Bridge",
433
  "gguf.tip": "<strong>Fits in VRAM ≠ works</strong>. The GGUF/VRAM calculators read a model's metadata to tell you if a quant <em>fits in your GPU</em>. This reads the SAME metadata (rope_theta, context_length, quant scheme, head geometry) straight from the <code>.gguf</code> header via HTTP Range — no multi-GB download — and answers the question they don't: does attention quality actually hold, and how much does the quant erode it (γ-shift, ΔPPL)?",
434
  "gguf.desc": "Paste a GGUF repo (e.g. <code>Qwen/Qwen2.5-7B-Instruct-GGUF</code>), pick a quant file, and get a TAF quality verdict: the model's effective attention horizon, plus how much the chosen quantization shifts γ for <em>this specific architecture</em>. Reads only the file header in your browser.",
@@ -1059,7 +1100,7 @@ export const TRANSLATIONS = {
1059
  "help.privacy.title": "Privacy",
1060
  "help.privacy.body": "Everything runs in your browser. No telemetry, no analytics, no data sent anywhere. Even the LLM model runs locally via WebGPU/WebAssembly. Your model_ids and questions never leave this page.",
1061
  "help.source.title": "Source & paper",
1062
- "help.source.body": "Source code: <a href=\"https://github.com/karlesmarin/tafagent\" target=\"_blank\">github.com/karlesmarin/tafagent</a><br>Paper: <em>Marin 2026 — Predicting How Transformers Attend</em> (<a href=\"https://zenodo.org/records/19826343\" target=\"_blank\">Zenodo</a>; arXiv forthcoming)<br>Dataset: <a href=\"https://huggingface.co/datasets/karlexmarin/taf-attention-decay\" target=\"_blank\">taf-attention-decay</a> — 58 γ-measurements across 32 models (CC-BY-4.0)",
1063
 
1064
  "footer.text": "© 2026 Carles Marin · Apache-2.0 · independent research · the tool that closes the loop of the paper.",
1065
 
@@ -1778,6 +1819,47 @@ export const TRANSLATIONS = {
1778
  "mode_desc.yarn": "Genera la configuración rope_scaling exacta para extender un modelo más allá de su contexto entrenado — más un veredicto TAF sobre si la calidad de atención aguanta realmente a la longitud objetivo.",
1779
  "modes.gguf": "🧊 Puente GGUF",
1780
  "mode_desc.gguf": "Lee la cabecera de metadata de un archivo GGUF (rope_theta, context_length, quant) en tu navegador y obtén un veredicto de calidad TAF — la pregunta que los calculadores de VRAM ignoran: ¿cabe Y funciona?",
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1781
  "gguf.title": "🧊 Puente de validez GGUF",
1782
  "gguf.tip": "<strong>Caber en VRAM ≠ funcionar</strong>. Los calculadores GGUF/VRAM leen la metadata de un modelo para decirte si un quant <em>cabe en tu GPU</em>. Esto lee la MISMA metadata (rope_theta, context_length, esquema de quant, geometría de cabezas) directamente de la cabecera <code>.gguf</code> vía HTTP Range — sin descargar GB — y responde lo que ellos no: ¿aguanta de verdad la calidad de atención, y cuánto la erosiona el quant (γ-shift, ΔPPL)?",
1783
  "gguf.desc": "Pega un repo GGUF (p.ej. <code>Qwen/Qwen2.5-7B-Instruct-GGUF</code>), elige un archivo de quant, y obtén un veredicto de calidad TAF: el horizonte de atención efectivo del modelo, más cuánto desplaza γ la cuantización elegida para <em>esta arquitectura concreta</em>. Solo lee la cabecera del archivo en tu navegador.",
@@ -2408,7 +2490,7 @@ export const TRANSLATIONS = {
2408
  "help.privacy.title": "Privacidad",
2409
  "help.privacy.body": "Todo corre en tu navegador. Sin telemetría, sin analytics, sin datos enviados a ningún sitio. Incluso el modelo LLM corre localmente vía WebGPU/WebAssembly. Tus model_ids y preguntas nunca abandonan esta página.",
2410
  "help.source.title": "Código fuente y paper",
2411
- "help.source.body": "Código: <a href=\"https://github.com/karlesmarin/tafagent\" target=\"_blank\">github.com/karlesmarin/tafagent</a><br>Paper: <em>Marin 2026 — Predicting How Transformers Attend</em> (<a href=\"https://zenodo.org/records/19826343\" target=\"_blank\">Zenodo</a>; arXiv próximamente)<br>Dataset: <a href=\"https://huggingface.co/datasets/karlexmarin/taf-attention-decay\" target=\"_blank\">taf-attention-decay</a> — 58 mediciones γ sobre 32 modelos (CC-BY-4.0)",
2412
 
2413
  "footer.text": "© 2026 Carles Marin · Apache-2.0 · investigación independiente · la herramienta que cierra el círculo del paper.",
2414
  },
@@ -2981,6 +3063,47 @@ export const TRANSLATIONS = {
2981
  "mode_desc.yarn": "Génère la configuration rope_scaling exacte pour étendre un modèle au-delà de son contexte d'entraînement — plus un verdict TAF sur la tenue réelle de la qualité d'attention à la longueur cible.",
2982
  "modes.gguf": "🧊 Pont GGUF",
2983
  "mode_desc.gguf": "Lit l'en-tête de métadonnées d'un fichier GGUF (rope_theta, context_length, quant) dans votre navigateur et donne un verdict de qualité TAF — la question que les calculateurs de VRAM ignorent : tient ET fonctionne ?",
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2984
  "gguf.title": "🧊 Pont de validité GGUF",
2985
  "gguf.tip": "<strong>Tenir dans la VRAM ≠ fonctionner</strong>. Les calculateurs GGUF/VRAM lisent les métadonnées d'un modèle pour dire si un quant <em>tient dans le GPU</em>. Ceci lit les MÊMES métadonnées (rope_theta, context_length, schéma de quant, géométrie des têtes) directement depuis l'en-tête <code>.gguf</code> via HTTP Range — sans télécharger des Go — et répond à ce qu'ils n'abordent pas : la qualité d'attention tient-elle vraiment, et de combien le quant l'érode-t-il (γ-shift, ΔPPL) ?",
2986
  "gguf.desc": "Collez un dépôt GGUF (ex. <code>Qwen/Qwen2.5-7B-Instruct-GGUF</code>), choisissez un fichier de quant, et obtenez un verdict de qualité TAF : l'horizon d'attention effectif du modèle, plus de combien la quantification choisie décale γ pour <em>cette architecture précise</em>. Ne lit que l'en-tête du fichier dans votre navigateur.",
@@ -3611,7 +3734,7 @@ export const TRANSLATIONS = {
3611
  "help.privacy.title": "Confidentialité",
3612
  "help.privacy.body": "Tout s'exécute dans votre navigateur. Pas de télémétrie, pas d'analytique, pas de données envoyées ailleurs. Même le modèle LLM s'exécute localement via WebGPU/WebAssembly. Vos model_ids et questions ne quittent jamais cette page.",
3613
  "help.source.title": "Code source et paper",
3614
- "help.source.body": "Code : <a href=\"https://github.com/karlesmarin/tafagent\" target=\"_blank\">github.com/karlesmarin/tafagent</a><br>Paper : <em>Marin 2026 — Predicting How Transformers Attend</em> (<a href=\"https://zenodo.org/records/19826343\" target=\"_blank\">Zenodo</a> ; arXiv à venir)<br>Dataset : <a href=\"https://huggingface.co/datasets/karlexmarin/taf-attention-decay\" target=\"_blank\">taf-attention-decay</a> — 58 mesures γ sur 32 modèles (CC-BY-4.0)",
3615
 
3616
  "footer.text": "© 2026 Carles Marin · Apache-2.0 · recherche indépendante · l'outil qui ferme la boucle du paper.",
3617
  },
@@ -4184,6 +4307,47 @@ export const TRANSLATIONS = {
4184
  "mode_desc.yarn": "生成精确的 rope_scaling 配置以将模型扩展到训练上下文之外 —— 外加 TAF 裁决:在目标长度下注意力质量是否真的撑得住。",
4185
  "modes.gguf": "🧊 GGUF 桥",
4186
  "mode_desc.gguf": "在浏览器内读取 GGUF 文件的元数据头(rope_theta、context_length、量化),给出 TAF 质量裁决 —— 显存计算器跳过的那个问题:塞得进且还能用吗?",
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4187
  "gguf.title": "🧊 GGUF 有效性桥",
4188
  "gguf.tip": "<strong>塞进显存 ≠ 能用</strong>。GGUF/显存计算器读取模型元数据来告诉你某量化<em>是否塞得进 GPU</em>。本工具通过 HTTP Range 直接从 <code>.gguf</code> 头读取同样的元数据(rope_theta、context_length、量化方案、注意力头几何)—— 无需下载数 GB —— 并回答它们不答的:注意力质量是否真的撑得住,量化又侵蚀了多少(γ-shift、ΔPPL)?",
4189
  "gguf.desc": "粘贴一个 GGUF 仓库(如 <code>Qwen/Qwen2.5-7B-Instruct-GGUF</code>),选择一个量化文件,获得 TAF 质量裁决:模型的有效注意力视界,以及所选量化对<em>这个具体架构</em>的 γ 位移有多大。只在浏览器内读取文件头。",
@@ -4814,7 +4978,7 @@ export const TRANSLATIONS = {
4814
  "help.privacy.title": "隐私",
4815
  "help.privacy.body": "一切都在您的浏览器中运行。无遥测,无分析,无数据发送到任何地方。即使是 LLM 模型也通过 WebGPU/WebAssembly 在本地运行。您的 model_ids 和问题永不离开此页面。",
4816
  "help.source.title": "源代码和论文",
4817
- "help.source.body": "源代码: <a href=\"https://github.com/karlesmarin/tafagent\" target=\"_blank\">github.com/karlesmarin/tafagent</a><br>论文: <em>Marin 2026 — Predicting How Transformers Attend</em> (<a href=\"https://zenodo.org/records/19826343\" target=\"_blank\">Zenodo</a>; arXiv 即将)<br>数据集: <a href=\"https://huggingface.co/datasets/karlexmarin/taf-attention-decay\" target=\"_blank\">taf-attention-decay</a> — 32个模型上的58次γ测量 (CC-BY-4.0)",
4818
 
4819
  "footer.text": "© 2026 Carles Marin · Apache-2.0 · 独立研究 · 闭合论文回路的工具。",
4820
  },
 
429
  "mode_desc.yarn": "Generate the exact rope_scaling config to extend a model past its trained context — plus a TAF verdict on whether attention quality actually holds at the target length.",
430
  "modes.gguf": "🧊 GGUF Bridge",
431
  "mode_desc.gguf": "Read a GGUF file's metadata header (rope_theta, context_length, quant) in your browser and get a TAF quality verdict — the question the VRAM calculators skip: fits AND works?",
432
+ "modes.launch": "🚀 Launch Flags",
433
+ "mode_desc.launch": "Model + GPU + context → the exact llama.cpp / Ollama launch command (-ngl, -c, --no-mmap, KV-cache type) with a VRAM breakdown and a TAF warning when your context is past the usable horizon.",
434
+ "launch.title": "🚀 Launch-Flag Generator",
435
+ "launch.tip": "<strong>Exact flags + why, not just \"fits\"</strong>. The VRAM calculators tell you whether a model fits. This gives you the copy-paste <code>llama.cpp</code> / <code>Ollama</code> command — <code>-ngl</code> layers to offload, <code>-c</code> context, <code>--no-mmap</code>, KV-cache type — AND the TAF reality check: if you allocate KV for 128K but the model's attention horizon is 32K, that VRAM is wasted.",
436
+ "launch.desc": "Pick a model, GPU and target context → get the exact launch command, a VRAM breakdown (weights + KV cache + overhead), and how many layers to offload. Solves the recurring \"what <code>-ngl</code> do I use?\" / Blackwell OOM guesswork.",
437
+ "launch.model_label": "HF model id:",
438
+ "launch.fetch_btn": "📥 Fetch geometry",
439
+ "launch.quant_label": "Quant:",
440
+ "launch.gpu_label": "GPU:",
441
+ "launch.ctx_label": "Target context L:",
442
+ "launch.adv_label": "Advanced:",
443
+ "launch.cache_label": "KV cache:",
444
+ "launch.fa_label": "Flash attention (-fa)",
445
+ "launch.gen_btn": "🚀 Generate flags",
446
+ "launch.need_id": "Enter a model id like 'Qwen/Qwen2.5-7B-Instruct'",
447
+ "launch.fetching": "Fetching config.json from HF Hub…",
448
+ "launch.layers": "layers",
449
+ "launch.fetched_hint": "Pick GPU + context, then Generate flags.",
450
+ "launch.need_fetch": "Fetch a model first (📥 Fetch geometry).",
451
+ "launch.verdict.fits": "FITS — fully on GPU",
452
+ "launch.verdict.partial": "PARTIAL — some layers on CPU (slower)",
453
+ "launch.verdict.too_big": "TOO BIG — won't fit any layers on this GPU",
454
+ "launch.r.weights": "Weights",
455
+ "launch.r.kv": "KV cache",
456
+ "launch.r.overhead": "Overhead / scratch",
457
+ "launch.r.total": "Total",
458
+ "launch.r.ngl": "Layers to offload (-ngl)",
459
+ "launch.r.all": "all",
460
+ "launch.r.note": "VRAM is an estimate (weights from bits/param, KV from head geometry, scratch coarse). d_horizon from γ_Padé. Verify the fit with a real load — leave ~1 GB headroom.",
461
+ "launch.warn.horizon_wasted": "Target context is well past the model's attention horizon — KV memory for context beyond it is wasted; the model won't attend there. (TAF)",
462
+ "launch.warn.beyond_trained": "L exceeds the trained context — you also need RoPE scaling to position-encode that far (see the YaRN Planner).",
463
+ "launch.warn.no_mmap": "All layers fit → added --no-mmap to force weights into physical VRAM (avoids the Blackwell illegal-memory / OOM-at-load issue).",
464
+ "launch.warn.partial": "Only some layers fit on GPU — the rest run on CPU (much slower). Drop to a smaller quant or shorter context to fit fully.",
465
+ "launch.warn.cpu_only": "Won't fit any layers at these settings — CPU only. Use a smaller quant/context or a bigger GPU.",
466
+ "launch.warn.no_params": "Couldn't read parameter count — weights size is a rough estimate from geometry.",
467
+ "launch.err.no_geom": "Fetch a model first to read its geometry.",
468
+ "launch.err.no_gpu": "Pick a GPU or enter a custom VRAM size.",
469
+ "launch.err.no_ctx": "Enter a target context length L.",
470
+ "launch.copy": "Copy command",
471
+ "help.v094.launch.title": "🚀 Launch-Flag Generator",
472
+ "help.v094.launch.body": "The VRAM calculators tell you <em>whether</em> a model fits; they don't hand you the command. This does. Pick a model (fetches geometry from HF <code>config.json</code>), a quant, a GPU and a target context — it computes the VRAM breakdown (weights + KV cache + scratch), how many layers to offload (<code>-ngl</code>), and emits the copy-paste <code>llama-server</code> and Ollama commands with <code>-c</code> context, <code>-fa</code> flash-attention, KV-cache type, and <code>--no-mmap</code> (the Blackwell OOM fix). Plus the TAF reality check no calculator gives: if you're allocating KV for a context past the model's d_horizon, it warns you that memory is wasted. <em>Use case</em>: 'What <code>-ngl</code> for Llama-70B-Q4 on my 4090?' → 39 of 80 layers, exact command, and a note if your context is past the usable horizon.",
473
  "gguf.title": "🧊 GGUF Validity Bridge",
474
  "gguf.tip": "<strong>Fits in VRAM ≠ works</strong>. The GGUF/VRAM calculators read a model's metadata to tell you if a quant <em>fits in your GPU</em>. This reads the SAME metadata (rope_theta, context_length, quant scheme, head geometry) straight from the <code>.gguf</code> header via HTTP Range — no multi-GB download — and answers the question they don't: does attention quality actually hold, and how much does the quant erode it (γ-shift, ΔPPL)?",
475
  "gguf.desc": "Paste a GGUF repo (e.g. <code>Qwen/Qwen2.5-7B-Instruct-GGUF</code>), pick a quant file, and get a TAF quality verdict: the model's effective attention horizon, plus how much the chosen quantization shifts γ for <em>this specific architecture</em>. Reads only the file header in your browser.",
 
1100
  "help.privacy.title": "Privacy",
1101
  "help.privacy.body": "Everything runs in your browser. No telemetry, no analytics, no data sent anywhere. Even the LLM model runs locally via WebGPU/WebAssembly. Your model_ids and questions never leave this page.",
1102
  "help.source.title": "Source & paper",
1103
+ "help.source.body": "Source code: <a href=\"https://github.com/karlesmarin/tafagent\" target=\"_blank\">github.com/karlesmarin/tafagent</a><br>Paper: <em>Marin 2026 — Predicting How Transformers Attend</em> (<a href=\"https://zenodo.org/records/20314038\" target=\"_blank\">Zenodo</a>; arXiv forthcoming)<br>Dataset: <a href=\"https://huggingface.co/datasets/karlexmarin/taf-attention-decay\" target=\"_blank\">taf-attention-decay</a> — 58 γ-measurements across 32 models (CC-BY-4.0)",
1104
 
1105
  "footer.text": "© 2026 Carles Marin · Apache-2.0 · independent research · the tool that closes the loop of the paper.",
1106
 
 
1819
  "mode_desc.yarn": "Genera la configuración rope_scaling exacta para extender un modelo más allá de su contexto entrenado — más un veredicto TAF sobre si la calidad de atención aguanta realmente a la longitud objetivo.",
1820
  "modes.gguf": "🧊 Puente GGUF",
1821
  "mode_desc.gguf": "Lee la cabecera de metadata de un archivo GGUF (rope_theta, context_length, quant) en tu navegador y obtén un veredicto de calidad TAF — la pregunta que los calculadores de VRAM ignoran: ¿cabe Y funciona?",
1822
+ "modes.launch": "🚀 Flags de arranque",
1823
+ "mode_desc.launch": "Modelo + GPU + contexto → el comando exacto de arranque llama.cpp / Ollama (-ngl, -c, --no-mmap, tipo de KV-cache) con desglose de VRAM y aviso TAF cuando tu contexto pasa el horizonte usable.",
1824
+ "launch.title": "🚀 Generador de flags de arranque",
1825
+ "launch.tip": "<strong>Flags exactos + por qué, no solo \"cabe\"</strong>. Los calculadores de VRAM te dicen si un modelo cabe. Esto te da el comando <code>llama.cpp</code> / <code>Ollama</code> para pegar — <code>-ngl</code> capas a offload, <code>-c</code> contexto, <code>--no-mmap</code>, tipo de KV-cache — Y el chequeo de realidad TAF: si reservas KV para 128K pero el horizonte de atención del modelo es 32K, esa VRAM se desperdicia.",
1826
+ "launch.desc": "Elige modelo, GPU y contexto objetivo → obtén el comando exacto, desglose de VRAM (pesos + KV cache + overhead), y cuántas capas hacer offload. Resuelve el típico \"¿qué <code>-ngl</code> uso?\" / OOM de Blackwell.",
1827
+ "launch.model_label": "ID del modelo HF:",
1828
+ "launch.fetch_btn": "📥 Obtener geometría",
1829
+ "launch.quant_label": "Quant:",
1830
+ "launch.gpu_label": "GPU:",
1831
+ "launch.ctx_label": "Contexto objetivo L:",
1832
+ "launch.adv_label": "Avanzado:",
1833
+ "launch.cache_label": "KV cache:",
1834
+ "launch.fa_label": "Flash attention (-fa)",
1835
+ "launch.gen_btn": "🚀 Generar flags",
1836
+ "launch.need_id": "Introduce un id de modelo como 'Qwen/Qwen2.5-7B-Instruct'",
1837
+ "launch.fetching": "Obteniendo config.json de HF Hub…",
1838
+ "launch.layers": "capas",
1839
+ "launch.fetched_hint": "Elige GPU + contexto, luego Generar flags.",
1840
+ "launch.need_fetch": "Obtén un modelo primero (📥 Obtener geometría).",
1841
+ "launch.verdict.fits": "CABE — todo en GPU",
1842
+ "launch.verdict.partial": "PARCIAL — algunas capas en CPU (más lento)",
1843
+ "launch.verdict.too_big": "DEMASIADO GRANDE — no cabe ninguna capa en esta GPU",
1844
+ "launch.r.weights": "Pesos",
1845
+ "launch.r.kv": "KV cache",
1846
+ "launch.r.overhead": "Overhead / scratch",
1847
+ "launch.r.total": "Total",
1848
+ "launch.r.ngl": "Capas a offload (-ngl)",
1849
+ "launch.r.all": "todas",
1850
+ "launch.r.note": "La VRAM es una estimación (pesos por bits/param, KV por geometría de cabezas, scratch aproximado). d_horizon desde γ_Padé. Verifica el ajuste con una carga real — deja ~1 GB de margen.",
1851
+ "launch.warn.horizon_wasted": "El contexto objetivo pasa bastante el horizonte de atención del modelo — la KV para contexto más allá se desperdicia; el modelo no atenderá ahí. (TAF)",
1852
+ "launch.warn.beyond_trained": "L supera el contexto entrenado — también necesitas RoPE scaling para codificar posiciones tan lejos (ver Planificador YaRN).",
1853
+ "launch.warn.no_mmap": "Todas las capas caben → añadido --no-mmap para forzar los pesos a VRAM física (evita el problema de illegal-memory / OOM-al-cargar de Blackwell).",
1854
+ "launch.warn.partial": "Solo caben algunas capas en GPU — el resto corre en CPU (mucho más lento). Baja a un quant menor o contexto más corto para que quepa entero.",
1855
+ "launch.warn.cpu_only": "No cabe ninguna capa con estos ajustes — solo CPU. Usa un quant/contexto menor o una GPU mayor.",
1856
+ "launch.warn.no_params": "No se pudo leer el nº de parámetros — el tamaño de pesos es una estimación aproximada por geometría.",
1857
+ "launch.err.no_geom": "Obtén un modelo primero para leer su geometría.",
1858
+ "launch.err.no_gpu": "Elige una GPU o introduce un tamaño de VRAM personalizado.",
1859
+ "launch.err.no_ctx": "Introduce una longitud de contexto objetivo L.",
1860
+ "launch.copy": "Copiar comando",
1861
+ "help.v094.launch.title": "🚀 Generador de flags de arranque",
1862
+ "help.v094.launch.body": "Los calculadores de VRAM te dicen <em>si</em> un modelo cabe; no te dan el comando. Esto sí. Elige un modelo (obtiene geometría del <code>config.json</code> de HF), un quant, una GPU y un contexto objetivo — calcula el desglose de VRAM (pesos + KV cache + scratch), cuántas capas hacer offload (<code>-ngl</code>), y emite los comandos para pegar de <code>llama-server</code> y Ollama con contexto <code>-c</code>, flash-attention <code>-fa</code>, tipo de KV-cache, y <code>--no-mmap</code> (el fix de OOM de Blackwell). Más el chequeo de realidad TAF que ningún calculador da: si reservas KV para un contexto más allá del d_horizon del modelo, te avisa de que esa memoria se desperdicia. <em>Caso de uso</em>: '¿Qué <code>-ngl</code> para Llama-70B-Q4 en mi 4090?' → 39 de 80 capas, comando exacto, y un aviso si tu contexto pasa el horizonte usable.",
1863
  "gguf.title": "🧊 Puente de validez GGUF",
1864
  "gguf.tip": "<strong>Caber en VRAM ≠ funcionar</strong>. Los calculadores GGUF/VRAM leen la metadata de un modelo para decirte si un quant <em>cabe en tu GPU</em>. Esto lee la MISMA metadata (rope_theta, context_length, esquema de quant, geometría de cabezas) directamente de la cabecera <code>.gguf</code> vía HTTP Range — sin descargar GB — y responde lo que ellos no: ¿aguanta de verdad la calidad de atención, y cuánto la erosiona el quant (γ-shift, ΔPPL)?",
1865
  "gguf.desc": "Pega un repo GGUF (p.ej. <code>Qwen/Qwen2.5-7B-Instruct-GGUF</code>), elige un archivo de quant, y obtén un veredicto de calidad TAF: el horizonte de atención efectivo del modelo, más cuánto desplaza γ la cuantización elegida para <em>esta arquitectura concreta</em>. Solo lee la cabecera del archivo en tu navegador.",
 
2490
  "help.privacy.title": "Privacidad",
2491
  "help.privacy.body": "Todo corre en tu navegador. Sin telemetría, sin analytics, sin datos enviados a ningún sitio. Incluso el modelo LLM corre localmente vía WebGPU/WebAssembly. Tus model_ids y preguntas nunca abandonan esta página.",
2492
  "help.source.title": "Código fuente y paper",
2493
+ "help.source.body": "Código: <a href=\"https://github.com/karlesmarin/tafagent\" target=\"_blank\">github.com/karlesmarin/tafagent</a><br>Paper: <em>Marin 2026 — Predicting How Transformers Attend</em> (<a href=\"https://zenodo.org/records/20314038\" target=\"_blank\">Zenodo</a>; arXiv próximamente)<br>Dataset: <a href=\"https://huggingface.co/datasets/karlexmarin/taf-attention-decay\" target=\"_blank\">taf-attention-decay</a> — 58 mediciones γ sobre 32 modelos (CC-BY-4.0)",
2494
 
2495
  "footer.text": "© 2026 Carles Marin · Apache-2.0 · investigación independiente · la herramienta que cierra el círculo del paper.",
2496
  },
 
3063
  "mode_desc.yarn": "Génère la configuration rope_scaling exacte pour étendre un modèle au-delà de son contexte d'entraînement — plus un verdict TAF sur la tenue réelle de la qualité d'attention à la longueur cible.",
3064
  "modes.gguf": "🧊 Pont GGUF",
3065
  "mode_desc.gguf": "Lit l'en-tête de métadonnées d'un fichier GGUF (rope_theta, context_length, quant) dans votre navigateur et donne un verdict de qualité TAF — la question que les calculateurs de VRAM ignorent : tient ET fonctionne ?",
3066
+ "modes.launch": "🚀 Flags de lancement",
3067
+ "mode_desc.launch": "Modèle + GPU + contexte → la commande exacte llama.cpp / Ollama (-ngl, -c, --no-mmap, type de KV-cache) avec ventilation VRAM et alerte TAF quand le contexte dépasse l'horizon utile.",
3068
+ "launch.title": "🚀 Générateur de flags de lancement",
3069
+ "launch.tip": "<strong>Flags exacts + pourquoi, pas juste \"tient\"</strong>. Les calculateurs de VRAM disent si un modèle tient. Ceci donne la commande <code>llama.cpp</code> / <code>Ollama</code> à coller — <code>-ngl</code> couches à décharger, <code>-c</code> contexte, <code>--no-mmap</code>, type de KV-cache — ET le contrôle de réalité TAF : si vous allouez du KV pour 128K mais que l'horizon d'attention du modèle est 32K, cette VRAM est gâchée.",
3070
+ "launch.desc": "Choisissez un modèle, un GPU et un contexte cible → obtenez la commande exacte, une ventilation VRAM (poids + KV cache + overhead), et combien de couches décharger. Résout le \"quel <code>-ngl</code> ?\" / OOM Blackwell récurrent.",
3071
+ "launch.model_label": "ID du modèle HF :",
3072
+ "launch.fetch_btn": "📥 Récupérer la géométrie",
3073
+ "launch.quant_label": "Quant :",
3074
+ "launch.gpu_label": "GPU :",
3075
+ "launch.ctx_label": "Contexte cible L :",
3076
+ "launch.adv_label": "Avancé :",
3077
+ "launch.cache_label": "KV cache :",
3078
+ "launch.fa_label": "Flash attention (-fa)",
3079
+ "launch.gen_btn": "🚀 Générer les flags",
3080
+ "launch.need_id": "Saisissez un id de modèle comme 'Qwen/Qwen2.5-7B-Instruct'",
3081
+ "launch.fetching": "Récupération de config.json depuis HF Hub…",
3082
+ "launch.layers": "couches",
3083
+ "launch.fetched_hint": "Choisissez GPU + contexte, puis Générer les flags.",
3084
+ "launch.need_fetch": "Récupérez d'abord un modèle (📥 Récupérer la géométrie).",
3085
+ "launch.verdict.fits": "TIENT — entièrement sur GPU",
3086
+ "launch.verdict.partial": "PARTIEL — certaines couches sur CPU (plus lent)",
3087
+ "launch.verdict.too_big": "TROP GROS — aucune couche ne tient sur ce GPU",
3088
+ "launch.r.weights": "Poids",
3089
+ "launch.r.kv": "KV cache",
3090
+ "launch.r.overhead": "Overhead / scratch",
3091
+ "launch.r.total": "Total",
3092
+ "launch.r.ngl": "Couches à décharger (-ngl)",
3093
+ "launch.r.all": "toutes",
3094
+ "launch.r.note": "La VRAM est une estimation (poids par bits/param, KV par géométrie des têtes, scratch grossier). d_horizon depuis γ_Padé. Vérifiez avec un chargement réel — laissez ~1 Go de marge.",
3095
+ "launch.warn.horizon_wasted": "Le contexte cible dépasse largement l'horizon d'attention du modèle — le KV au-delà est gâché ; le modèle n'y prêtera pas attention. (TAF)",
3096
+ "launch.warn.beyond_trained": "L dépasse le contexte d'entraînement — il faut aussi un RoPE scaling pour encoder les positions aussi loin (voir le Planificateur YaRN).",
3097
+ "launch.warn.no_mmap": "Toutes les couches tiennent → ajout de --no-mmap pour forcer les poids en VRAM physique (évite le problème illegal-memory / OOM-au-chargement de Blackwell).",
3098
+ "launch.warn.partial": "Seules certaines couches tiennent sur GPU — le reste tourne sur CPU (bien plus lent). Passez à un quant plus petit ou un contexte plus court pour tout faire tenir.",
3099
+ "launch.warn.cpu_only": "Aucune couche ne tient avec ces réglages — CPU seul. Utilisez un quant/contexte plus petit ou un GPU plus grand.",
3100
+ "launch.warn.no_params": "Impossible de lire le nombre de paramètres — la taille des poids est une estimation grossière par géométrie.",
3101
+ "launch.err.no_geom": "Récupérez d'abord un modèle pour lire sa géométrie.",
3102
+ "launch.err.no_gpu": "Choisissez un GPU ou saisissez une taille de VRAM personnalisée.",
3103
+ "launch.err.no_ctx": "Saisissez une longueur de contexte cible L.",
3104
+ "launch.copy": "Copier la commande",
3105
+ "help.v094.launch.title": "🚀 Générateur de flags de lancement",
3106
+ "help.v094.launch.body": "Les calculateurs de VRAM disent <em>si</em> un modèle tient ; ils ne donnent pas la commande. Ceci si. Choisissez un modèle (récupère la géométrie du <code>config.json</code> HF), un quant, un GPU et un contexte cible — il calcule la ventilation VRAM (poids + KV cache + scratch), combien de couches décharger (<code>-ngl</code>), et émet les commandes à coller <code>llama-server</code> et Ollama avec contexte <code>-c</code>, flash-attention <code>-fa</code>, type de KV-cache, et <code>--no-mmap</code> (le fix OOM Blackwell). Plus le contrôle de réalité TAF qu'aucun calculateur ne donne : si vous allouez du KV pour un contexte au-delà du d_horizon du modèle, il vous avertit que cette mémoire est gâchée. <em>Cas d'usage</em> : 'Quel <code>-ngl</code> pour Llama-70B-Q4 sur mon 4090 ?' → 39 couches sur 80, commande exacte, et une note si le contexte dépasse l'horizon utile.",
3107
  "gguf.title": "🧊 Pont de validité GGUF",
3108
  "gguf.tip": "<strong>Tenir dans la VRAM ≠ fonctionner</strong>. Les calculateurs GGUF/VRAM lisent les métadonnées d'un modèle pour dire si un quant <em>tient dans le GPU</em>. Ceci lit les MÊMES métadonnées (rope_theta, context_length, schéma de quant, géométrie des têtes) directement depuis l'en-tête <code>.gguf</code> via HTTP Range — sans télécharger des Go — et répond à ce qu'ils n'abordent pas : la qualité d'attention tient-elle vraiment, et de combien le quant l'érode-t-il (γ-shift, ΔPPL) ?",
3109
  "gguf.desc": "Collez un dépôt GGUF (ex. <code>Qwen/Qwen2.5-7B-Instruct-GGUF</code>), choisissez un fichier de quant, et obtenez un verdict de qualité TAF : l'horizon d'attention effectif du modèle, plus de combien la quantification choisie décale γ pour <em>cette architecture précise</em>. Ne lit que l'en-tête du fichier dans votre navigateur.",
 
3734
  "help.privacy.title": "Confidentialité",
3735
  "help.privacy.body": "Tout s'exécute dans votre navigateur. Pas de télémétrie, pas d'analytique, pas de données envoyées ailleurs. Même le modèle LLM s'exécute localement via WebGPU/WebAssembly. Vos model_ids et questions ne quittent jamais cette page.",
3736
  "help.source.title": "Code source et paper",
3737
+ "help.source.body": "Code : <a href=\"https://github.com/karlesmarin/tafagent\" target=\"_blank\">github.com/karlesmarin/tafagent</a><br>Paper : <em>Marin 2026 — Predicting How Transformers Attend</em> (<a href=\"https://zenodo.org/records/20314038\" target=\"_blank\">Zenodo</a> ; arXiv à venir)<br>Dataset : <a href=\"https://huggingface.co/datasets/karlexmarin/taf-attention-decay\" target=\"_blank\">taf-attention-decay</a> — 58 mesures γ sur 32 modèles (CC-BY-4.0)",
3738
 
3739
  "footer.text": "© 2026 Carles Marin · Apache-2.0 · recherche indépendante · l'outil qui ferme la boucle du paper.",
3740
  },
 
4307
  "mode_desc.yarn": "生成精确的 rope_scaling 配置以将模型扩展到训练上下文之外 —— 外加 TAF 裁决:在目标长度下注意力质量是否真的撑得住。",
4308
  "modes.gguf": "🧊 GGUF 桥",
4309
  "mode_desc.gguf": "在浏览器内读取 GGUF 文件的元数据头(rope_theta、context_length、量化),给出 TAF 质量裁决 —— 显存计算器跳过的那个问题:塞得进且还能用吗?",
4310
+ "modes.launch": "🚀 启动参数",
4311
+ "mode_desc.launch": "模型 + GPU + 上下文 → 精确的 llama.cpp / Ollama 启动命令(-ngl、-c、--no-mmap、KV-cache 类型),附显存明细,以及当上下文超过可用视界时的 TAF 警告。",
4312
+ "launch.title": "🚀 启动参数生成器",
4313
+ "launch.tip": "<strong>精确参数 + 原因,不只是\"塞得进\"</strong>。显存计算器告诉你模型是否塞得进。本工具给你可复制粘贴的 <code>llama.cpp</code> / <code>Ollama</code> 命令 —— <code>-ngl</code> 卸载层数、<code>-c</code> 上下文、<code>--no-mmap</code>、KV-cache 类型 —— 以及 TAF 现实检查:若你为 128K 分配 KV 但模型注意力视界只有 32K,那部分显存就浪费了。",
4314
+ "launch.desc": "选择模型、GPU 和目标上下文 → 获得精确启动命令、显存明细(权重 + KV cache + 开销),以及卸载多少层。解决常见的\"该用什么 <code>-ngl</code>?\"/ Blackwell OOM 的猜测。",
4315
+ "launch.model_label": "HF 模型 id:",
4316
+ "launch.fetch_btn": "📥 获取几何",
4317
+ "launch.quant_label": "量化:",
4318
+ "launch.gpu_label": "GPU:",
4319
+ "launch.ctx_label": "目标上下文 L:",
4320
+ "launch.adv_label": "高级:",
4321
+ "launch.cache_label": "KV cache:",
4322
+ "launch.fa_label": "Flash attention (-fa)",
4323
+ "launch.gen_btn": "🚀 生成参数",
4324
+ "launch.need_id": "输入模型 id,如 'Qwen/Qwen2.5-7B-Instruct'",
4325
+ "launch.fetching": "正在从 HF Hub 获取 config.json…",
4326
+ "launch.layers": "层",
4327
+ "launch.fetched_hint": "选择 GPU + 上下文,然后生成参数。",
4328
+ "launch.need_fetch": "请先获取模型(📥 获取几何)。",
4329
+ "launch.verdict.fits": "塞得进 —— 全部在 GPU",
4330
+ "launch.verdict.partial": "部分 —— 部分层在 CPU(更慢)",
4331
+ "launch.verdict.too_big": "太大 —— 此 GPU 一层都放不下",
4332
+ "launch.r.weights": "权重",
4333
+ "launch.r.kv": "KV cache",
4334
+ "launch.r.overhead": "开销 / scratch",
4335
+ "launch.r.total": "总计",
4336
+ "launch.r.ngl": "卸载层数 (-ngl)",
4337
+ "launch.r.all": "全部",
4338
+ "launch.r.note": "显存为估计值(权重按 bits/参数,KV 按头几何,scratch 粗略)。d_horizon 来自 γ_Padé。请用真实加载核实 —— 留约 1 GB 余量。",
4339
+ "launch.warn.horizon_wasted": "目标上下文远超模型的注意力视界 —— 超出部分的 KV 内存被浪费;模型不会关注那里。(TAF)",
4340
+ "launch.warn.beyond_trained": "L 超过训练上下文 —— 还需要 RoPE scaling 才能编码那么远的位置(见 YaRN 规划器)。",
4341
+ "launch.warn.no_mmap": "所有层都放得下 → 已加 --no-mmap 强制权重进入物理显存(避免 Blackwell 的 illegal-memory / 加载时 OOM 问题)。",
4342
+ "launch.warn.partial": "只有部分层放进 GPU —— 其余在 CPU 运行(慢得多)。换更小的量化或更短的上下文以完整放入。",
4343
+ "launch.warn.cpu_only": "这些设置下一层都放不下 —— 仅 CPU。请用更小的量化/上下文或更大的 GPU。",
4344
+ "launch.warn.no_params": "无法读取参数量 —— 权重大小为按几何的粗略估计。",
4345
+ "launch.err.no_geom": "请先获取模型以读取其几何。",
4346
+ "launch.err.no_gpu": "请选择 GPU 或输入自定义显存大小。",
4347
+ "launch.err.no_ctx": "请输入目标上下文长度 L。",
4348
+ "launch.copy": "复制命令",
4349
+ "help.v094.launch.title": "🚀 启动参数生成器",
4350
+ "help.v094.launch.body": "显存计算器告诉你模型<em>是否</em>塞得进;它们不给你命令。本工具给。选择一个模型(从 HF <code>config.json</code> 获取几何)、一个量化、一个 GPU 和目标上下文 —— 它计算显存明细(权重 + KV cache + scratch)、卸载多少层(<code>-ngl</code>),并输出可复制粘贴的 <code>llama-server</code> 和 Ollama 命令,带 <code>-c</code> 上下文、<code>-fa</code> flash-attention、KV-cache 类型,以及 <code>--no-mmap</code>(Blackwell OOM 修复)。还有任何计算器都不给的 TAF 现实检查:若你为超过模型 d_horizon 的上下文分配 KV,它会警告你那部分内存被浪费。<em>用例</em>:'我的 4090 上 Llama-70B-Q4 该用什么 <code>-ngl</code>?' → 80 层中的 39 层、精确命令,以及若上下文超过可用视界的提示。",
4351
  "gguf.title": "🧊 GGUF 有效性桥",
4352
  "gguf.tip": "<strong>塞进显存 ≠ 能用</strong>。GGUF/显存计算器读取模型元数据来告诉你某量化<em>是否塞得进 GPU</em>。本工具通过 HTTP Range 直接从 <code>.gguf</code> 头读取同样的元数据(rope_theta、context_length、量化方案、注意力头几何)—— 无需下载数 GB —— 并回答它们不答的:注意力质量是否真的撑得住,量化又侵蚀了多少(γ-shift、ΔPPL)?",
4353
  "gguf.desc": "粘贴一个 GGUF 仓库(如 <code>Qwen/Qwen2.5-7B-Instruct-GGUF</code>),选择一个量化文件,获得 TAF 质量裁决:模型的有效注意力视界,以及所选量化对<em>这个具体架构</em>的 γ 位移有多大。只在浏览器内读取文件头。",
 
4978
  "help.privacy.title": "隐私",
4979
  "help.privacy.body": "一切都在您的浏览器中运行。无遥测,无分析,无数据发送到任何地方。即使是 LLM 模型也通过 WebGPU/WebAssembly 在本地运行。您的 model_ids 和问题永不离开此页面。",
4980
  "help.source.title": "源代码和论文",
4981
+ "help.source.body": "源代码: <a href=\"https://github.com/karlesmarin/tafagent\" target=\"_blank\">github.com/karlesmarin/tafagent</a><br>论文: <em>Marin 2026 — Predicting How Transformers Attend</em> (<a href=\"https://zenodo.org/records/20314038\" target=\"_blank\">Zenodo</a>; arXiv 即将)<br>数据集: <a href=\"https://huggingface.co/datasets/karlexmarin/taf-attention-decay\" target=\"_blank\">taf-attention-decay</a> — 32个模型上的58次γ测量 (CC-BY-4.0)",
4982
 
4983
  "footer.text": "© 2026 Carles Marin · Apache-2.0 · 独立研究 · 闭合论文回路的工具。",
4984
  },
js/launch_flags.js ADDED
@@ -0,0 +1,170 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ // Launch-Flag Generator (v0.9.4 anti-bullshit pack)
2
+ //
3
+ // Input a model + GPU + target context → the exact llama.cpp / Ollama launch
4
+ // flags (-ngl layers to offload, -c context, --no-mmap, cache-type), with a
5
+ // VRAM breakdown AND the TAF angle the pure VRAM calculators miss: "you CAN
6
+ // allocate KV for 128K, but this model's attention horizon is ~32K — context
7
+ // past that is wasted memory." Solves the recurring r/LocalLLaMA pain of
8
+ // guessing -ngl / hitting Blackwell OOM. All browser-only.
9
+
10
+ import { gammaPade } from "./gamma_check.js";
11
+ import { dHorizon } from "./yarn_planner.js";
12
+
13
+ // Curated GPU VRAM presets (GB). Unified-memory Macs included (shared pool).
14
+ export const GPU_PRESETS = [
15
+ { id: "rtx3060", label: "RTX 3060 12GB", vram: 12 },
16
+ { id: "rtx4060ti",label: "RTX 4060 Ti 16GB", vram: 16 },
17
+ { id: "rtx4070", label: "RTX 4070 12GB", vram: 12 },
18
+ { id: "rtx4080", label: "RTX 4080 16GB", vram: 16 },
19
+ { id: "rtx3090", label: "RTX 3090 24GB", vram: 24 },
20
+ { id: "rtx4090", label: "RTX 4090 24GB", vram: 24 },
21
+ { id: "rtx5090", label: "RTX 5090 32GB", vram: 32 },
22
+ { id: "a100_40", label: "A100 40GB", vram: 40 },
23
+ { id: "a100_80", label: "A100 80GB", vram: 80 },
24
+ { id: "h100", label: "H100 80GB", vram: 80 },
25
+ { id: "h200", label: "H200 141GB", vram: 141 },
26
+ { id: "mac32", label: "Mac 32GB (unified)",vram: 24 }, // ~75% usable for GPU
27
+ { id: "mac64", label: "Mac 64GB (unified)",vram: 48 },
28
+ { id: "mac128", label: "Mac 128GB (unified)",vram: 96 },
29
+ ];
30
+
31
+ // Effective bits-per-weight per GGUF quant (includes K-quant block overhead).
32
+ export const QUANT_BPW = {
33
+ F16: 16.0,
34
+ Q8_0: 8.5,
35
+ Q6_K: 6.56,
36
+ Q5_K_M: 5.67,
37
+ Q4_K_M: 4.83,
38
+ Q4_0: 4.55,
39
+ Q3_K_M: 3.91,
40
+ Q2_K: 2.63,
41
+ };
42
+
43
+ // KV-cache element bytes per cache dtype.
44
+ const CACHE_BYTES = { fp16: 2, q8_0: 1, q4_0: 0.5 };
45
+
46
+ const GB = 1024 ** 3;
47
+
48
+ // Estimate parameter count from geometry when the model card doesn't state it.
49
+ // Uses the exact decoder layout (attention with GQA + SwiGLU MLP + embeddings)
50
+ // when intermediate_size is known — the 12·h² shortcut undercounts modern
51
+ // large-FFN models (Qwen2.5-7B is really 7.6B, not the ~5.4B the shortcut gives).
52
+ export function estimateNParams({ nParams, hidden, nLayers, vocab, intermediate, nKvHeads, headDim, tieEmbeddings }) {
53
+ if (Number.isFinite(nParams) && nParams > 0) return nParams;
54
+ if (!hidden || !nLayers) return null;
55
+ let perLayer;
56
+ if (intermediate) {
57
+ const kvDim = (nKvHeads && headDim) ? nKvHeads * headDim : hidden; // GQA shrinks K,V
58
+ const attn = 2 * hidden * hidden + 2 * hidden * kvDim; // q,o + k,v
59
+ const mlp = 3 * hidden * intermediate; // gate,up,down (SwiGLU)
60
+ perLayer = attn + mlp;
61
+ } else {
62
+ perLayer = 12 * hidden * hidden; // fallback heuristic
63
+ }
64
+ const embed = vocab ? (tieEmbeddings ? 1 : 2) * vocab * hidden : 0;
65
+ return perLayer * nLayers + embed;
66
+ }
67
+
68
+ // KV cache bytes for the whole model at context L.
69
+ function kvCacheBytes(nLayers, nKvHeads, headDim, L, cacheType) {
70
+ const elem = CACHE_BYTES[cacheType] ?? 2;
71
+ return 2 /* K+V */ * nLayers * nKvHeads * headDim * L * elem;
72
+ }
73
+
74
+ export function planLaunch(opts) {
75
+ const {
76
+ nParams, nLayers, nKvHeads, headDim, hidden, ropeTheta, ctxTrain,
77
+ quant = "Q4_K_M", vramGB, targetCtx, cacheType = "fp16", flashAttn = true,
78
+ } = opts;
79
+
80
+ const out = { ok: false, warnings: [] };
81
+ if (!nLayers || !nKvHeads || !headDim) { out.verdict = "no_geometry"; return out; }
82
+ if (!Number.isFinite(vramGB) || vramGB <= 0) { out.verdict = "no_gpu"; return out; }
83
+ if (!Number.isFinite(targetCtx) || targetCtx <= 0) { out.verdict = "no_ctx"; return out; }
84
+
85
+ const bpw = QUANT_BPW[quant] ?? 4.83;
86
+ const N = estimateNParams({
87
+ nParams, hidden, nLayers, vocab: opts.vocab,
88
+ intermediate: opts.intermediate, nKvHeads, headDim, tieEmbeddings: opts.tieEmbeddings,
89
+ });
90
+
91
+ const weightsB = N ? (N * bpw / 8) : null;
92
+ const kvB = kvCacheBytes(nLayers, nKvHeads, headDim, targetCtx, cacheType);
93
+ // Compute/scratch buffer: roughly scales with context × hidden. Flash-attention
94
+ // shrinks the attention scratch substantially. Coarse estimate, flagged as such.
95
+ const scratchB = (flashAttn ? 0.25 : 0.6) * GB + (hidden ? 0.5 * hidden * targetCtx * 2 : 0);
96
+ const overheadB = 0.4 * GB + scratchB;
97
+
98
+ const weightsGB = weightsB != null ? weightsB / GB : null;
99
+ const kvGB = kvB / GB;
100
+ const overheadGB = overheadB / GB;
101
+ const totalGB = (weightsGB ?? 0) + kvGB + overheadGB;
102
+
103
+ // Layer-offload (-ngl). ~88% of weights live in transformer layers; the rest
104
+ // (embeddings/output) load with any GPU offload.
105
+ const layerFrac = 0.88;
106
+ const layerWeightsGB = weightsGB != null ? weightsGB * layerFrac : null;
107
+ const nonLayerGB = weightsGB != null ? weightsGB * (1 - layerFrac) : 0;
108
+ const kvPerLayerGB = kvGB / nLayers;
109
+ const perLayerGB = (layerWeightsGB != null ? layerWeightsGB / nLayers : 0) + kvPerLayerGB;
110
+
111
+ let ngl, allOnGpu, fits;
112
+ if (weightsGB == null) {
113
+ ngl = null; allOnGpu = false; fits = false;
114
+ out.warnings.push({ code: "no_params" });
115
+ } else if (totalGB <= vramGB) {
116
+ ngl = nLayers; allOnGpu = true; fits = true;
117
+ } else {
118
+ const avail = vramGB - overheadGB - nonLayerGB;
119
+ ngl = perLayerGB > 0 ? Math.max(0, Math.floor(avail / perLayerGB)) : 0;
120
+ ngl = Math.min(ngl, nLayers);
121
+ allOnGpu = false; fits = false;
122
+ }
123
+
124
+ // TAF horizon: does the model's attention actually reach the context you're
125
+ // paying KV memory for? This is the differentiator vs pure VRAM calculators.
126
+ const theta = Number(ropeTheta) || 10000;
127
+ const gammaTrain = ctxTrain ? gammaPade(theta, ctxTrain) : null;
128
+ const dHoriz = gammaTrain != null ? dHorizon(theta, gammaTrain) : null;
129
+ const horizonWasted = dHoriz != null && targetCtx > dHoriz * 1.25;
130
+ if (horizonWasted) out.warnings.push({ code: "horizon_wasted", params: { dHoriz, target: targetCtx } });
131
+ if (ctxTrain && targetCtx > ctxTrain) out.warnings.push({ code: "beyond_trained", params: { ctxTrain, target: targetCtx } });
132
+ if (allOnGpu) out.warnings.push({ code: "no_mmap_blackwell" });
133
+ if (!fits && ngl > 0) out.warnings.push({ code: "partial_offload", params: { ngl, nLayers } });
134
+ if (!fits && ngl === 0) out.warnings.push({ code: "cpu_only", params: {} });
135
+
136
+ out.ok = true;
137
+ Object.assign(out, {
138
+ verdict: fits ? "fits" : (ngl > 0 ? "partial" : "too_big"),
139
+ nParams: N, bpw, quant, cacheType, flashAttn,
140
+ weightsGB, kvGB, overheadGB, totalGB, vramGB,
141
+ ngl, allOnGpu, nLayers,
142
+ theta, dHoriz, gammaTrain, ctxTrain, targetCtx,
143
+ });
144
+ return out;
145
+ }
146
+
147
+ // Build the copy-paste commands for both engines.
148
+ export function launchCommands(plan, modelRef = "<model.gguf>") {
149
+ const nglStr = plan.allOnGpu ? "99" : String(plan.ngl);
150
+ const cache = plan.cacheType !== "fp16" ? ` -ctk ${plan.cacheType} -ctv ${plan.cacheType}` : "";
151
+ const fa = plan.flashAttn ? " -fa" : "";
152
+ const mmap = plan.allOnGpu ? " --no-mmap" : "";
153
+ const llamacpp =
154
+ `llama-server -m ${modelRef} \\\n` +
155
+ ` -ngl ${nglStr} -c ${plan.targetCtx}${fa}${cache}${mmap}`;
156
+
157
+ // Ollama: Modelfile params + env. num_gpu = layers on GPU.
158
+ const olEnv = [
159
+ plan.flashAttn ? "OLLAMA_FLASH_ATTENTION=1" : null,
160
+ plan.cacheType !== "fp16" ? `OLLAMA_KV_CACHE_TYPE=${plan.cacheType}` : null,
161
+ ].filter(Boolean).join(" ");
162
+ const ollama =
163
+ (olEnv ? olEnv + " \\\n" : "") +
164
+ `ollama run <model>\n` +
165
+ `# Modelfile / params:\n` +
166
+ `PARAMETER num_ctx ${plan.targetCtx}\n` +
167
+ `PARAMETER num_gpu ${nglStr === "99" ? plan.nLayers : nglStr}`;
168
+
169
+ return { llamacpp, ollama };
170
+ }
js/main.js CHANGED
@@ -40,6 +40,7 @@ import {
40
  } from "./longscore.js";
41
  import { planExtension, suggestRopeType } from "./yarn_planner.js";
42
  import { listGgufFiles, fetchGgufMetadata, ggufToConfig, quantFromFilename, analyzeGguf } from "./gguf_bridge.js";
 
43
 
44
  // Attach HF Hub search-as-you-type to all 5 model id inputs (Profile, Recipe,
45
  // Unmask, Template, Quant). Hits public huggingface.co/api/models. Idempotent.
@@ -235,6 +236,7 @@ document.addEventListener("click", (e) => {
235
  hub: "hub-section",
236
  yarn: "yarn-section",
237
  gguf: "gguf-section",
 
238
  }[targetMode];
239
  if (sectionId) {
240
  const sec = document.getElementById(sectionId);
@@ -259,7 +261,7 @@ document.querySelectorAll(".mode-btn").forEach(btn => {
259
  "diagnose-section", "phase-section", "unmask-section",
260
  "template-section", "arena-section", "contam-section",
261
  "quant-section", "drift-section", "niah-section",
262
- "saturation-section", "cot-section", "peft-section", "cache-section", "speculative-section", "tax-section", "longscore-section", "hub-section", "yarn-section", "gguf-section"].forEach(id => {
263
  const el = $(id);
264
  if (el) el.style.display = "none";
265
  });
@@ -280,6 +282,7 @@ document.querySelectorAll(".mode-btn").forEach(btn => {
280
  hub: "hub-section",
281
  yarn: "yarn-section",
282
  gguf: "gguf-section",
 
283
  };
284
  const sectionId = sectionMap[mode];
285
  if (sectionId) $(sectionId).style.display = "";
@@ -295,6 +298,7 @@ document.querySelectorAll(".mode-btn").forEach(btn => {
295
  if (mode === "hub") initHub();
296
  if (mode === "yarn") initYarn();
297
  if (mode === "gguf") initGguf();
 
298
  });
299
  });
300
 
@@ -4951,6 +4955,127 @@ function renderGgufComparison(cfg, rows) {
4951
  <p class="subtle" style="font-size:0.88em;">${t("gguf.r.note")}</p>`;
4952
  }
4953
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4954
  // ════════════════════════════════════════════════════════════════════
4955
  // Bootstrap
4956
  // ════════════════════════════════════════════════════════════════════
 
40
  } from "./longscore.js";
41
  import { planExtension, suggestRopeType } from "./yarn_planner.js";
42
  import { listGgufFiles, fetchGgufMetadata, ggufToConfig, quantFromFilename, analyzeGguf } from "./gguf_bridge.js";
43
+ import { GPU_PRESETS, QUANT_BPW, planLaunch, launchCommands } from "./launch_flags.js";
44
 
45
  // Attach HF Hub search-as-you-type to all 5 model id inputs (Profile, Recipe,
46
  // Unmask, Template, Quant). Hits public huggingface.co/api/models. Idempotent.
 
236
  hub: "hub-section",
237
  yarn: "yarn-section",
238
  gguf: "gguf-section",
239
+ launch: "launch-section",
240
  }[targetMode];
241
  if (sectionId) {
242
  const sec = document.getElementById(sectionId);
 
261
  "diagnose-section", "phase-section", "unmask-section",
262
  "template-section", "arena-section", "contam-section",
263
  "quant-section", "drift-section", "niah-section",
264
+ "saturation-section", "cot-section", "peft-section", "cache-section", "speculative-section", "tax-section", "longscore-section", "hub-section", "yarn-section", "gguf-section", "launch-section"].forEach(id => {
265
  const el = $(id);
266
  if (el) el.style.display = "none";
267
  });
 
282
  hub: "hub-section",
283
  yarn: "yarn-section",
284
  gguf: "gguf-section",
285
+ launch: "launch-section",
286
  };
287
  const sectionId = sectionMap[mode];
288
  if (sectionId) $(sectionId).style.display = "";
 
298
  if (mode === "hub") initHub();
299
  if (mode === "yarn") initYarn();
300
  if (mode === "gguf") initGguf();
301
+ if (mode === "launch") initLaunch();
302
  });
303
  });
304
 
 
4955
  <p class="subtle" style="font-size:0.88em;">${t("gguf.r.note")}</p>`;
4956
  }
4957
 
4958
+ // ════════════════════════════════════════════════════════════════════
4959
+ // 🚀 Launch-Flag Generator (v0.9.4)
4960
+ // ════════════════════════════════════════════════════════════════════
4961
+ let _launchWired = false;
4962
+ let _launchGeom = null; // fetched model geometry
4963
+ function initLaunch() {
4964
+ if (_launchWired) return;
4965
+ _launchWired = true;
4966
+
4967
+ // Populate GPU presets.
4968
+ const gpuSel = $("launch-gpu");
4969
+ if (gpuSel && !gpuSel.options.length) {
4970
+ gpuSel.innerHTML = GPU_PRESETS.map(g => `<option value="${g.vram}">${escapeHtml(g.label)}</option>`).join("");
4971
+ gpuSel.value = "24"; // sensible default (4090)
4972
+ }
4973
+
4974
+ const fetchBtn = $("launch-fetch-btn");
4975
+ const modelEl = $("launch-model");
4976
+ // Picking from autocomplete auto-fetches geometry (matches the other modes).
4977
+ if (modelEl) attachHfAutocomplete(modelEl, { onSelect: () => fetchBtn?.click() });
4978
+
4979
+ fetchBtn?.addEventListener("click", async () => {
4980
+ const id = (modelEl.value || "").trim();
4981
+ if (!id) { $("launch-status").textContent = "⚠ " + t("launch.need_id"); return; }
4982
+ $("launch-status").textContent = "⏳ " + t("launch.fetching");
4983
+ fetchBtn.disabled = true;
4984
+ state.lastModelId = id;
4985
+ try {
4986
+ const cfg = await fetchHfConfig(id);
4987
+ const nAttn = cfg.num_attention_heads ?? null;
4988
+ const rs = (cfg.rope_scaling && typeof cfg.rope_scaling === "object") ? cfg.rope_scaling : {};
4989
+ _launchGeom = {
4990
+ nLayers: cfg.num_hidden_layers ?? null,
4991
+ nKvHeads: cfg.num_key_value_heads ?? nAttn,
4992
+ headDim: cfg.head_dim ?? (cfg.hidden_size && nAttn ? cfg.hidden_size / nAttn : null),
4993
+ hidden: cfg.hidden_size ?? null,
4994
+ vocab: cfg.vocab_size ?? null,
4995
+ intermediate: cfg.intermediate_size ?? null,
4996
+ tieEmbeddings: cfg.tie_word_embeddings ?? false,
4997
+ nParams: cfg.num_parameters ?? null,
4998
+ ropeTheta: cfg.rope_theta ?? 10000,
4999
+ ctxTrain: rs.original_max_position_embeddings ?? cfg.max_position_embeddings ?? null,
5000
+ };
5001
+ if (!$("launch-ctx").value && _launchGeom.ctxTrain) $("launch-ctx").value = _launchGeom.ctxTrain;
5002
+ const via = cfg.__via_mirror ? ` (via ${escapeHtml(cfg.__via_mirror)})` : "";
5003
+ $("launch-status").innerHTML = `✅ <strong>${escapeHtml(id)}</strong>${via}: ${_launchGeom.nLayers} ${t("launch.layers")}, ` +
5004
+ `GQA ${nAttn}:${_launchGeom.nKvHeads}, θ=${_thetaFmt(_launchGeom.ropeTheta)}, ctx ${_yarnFmtK(_launchGeom.ctxTrain)}. ${t("launch.fetched_hint")}`;
5005
+ } catch (err) {
5006
+ $("launch-status").textContent = `❌ ${err.message}`;
5007
+ } finally {
5008
+ fetchBtn.disabled = false;
5009
+ }
5010
+ });
5011
+
5012
+ $("launch-gen-btn")?.addEventListener("click", () => {
5013
+ if (!_launchGeom) { $("launch-status").textContent = "⚠ " + t("launch.need_fetch"); return; }
5014
+ const vram = parseFloat($("launch-vram").value) || parseFloat(gpuSel.value);
5015
+ const plan = planLaunch({
5016
+ ..._launchGeom,
5017
+ quant: $("launch-quant").value,
5018
+ vramGB: vram,
5019
+ targetCtx: parseFloat($("launch-ctx").value),
5020
+ cacheType: $("launch-cache").value,
5021
+ flashAttn: $("launch-fa").checked,
5022
+ });
5023
+ renderLaunch(plan);
5024
+ });
5025
+ }
5026
+
5027
+ function _launchWarnText(w) {
5028
+ switch (w.code) {
5029
+ case "horizon_wasted": return `${t("launch.warn.horizon_wasted")} (d_horizon ≈ ${_yarnFmtK(w.params.dHoriz)}, L=${_yarnFmtK(w.params.target)})`;
5030
+ case "beyond_trained": return `${t("launch.warn.beyond_trained")} (${_yarnFmtK(w.params.ctxTrain)} → ${_yarnFmtK(w.params.target)})`;
5031
+ case "no_mmap_blackwell":return t("launch.warn.no_mmap");
5032
+ case "partial_offload": return `${t("launch.warn.partial")} (${w.params.ngl}/${w.params.nLayers})`;
5033
+ case "cpu_only": return t("launch.warn.cpu_only");
5034
+ case "no_params": return t("launch.warn.no_params");
5035
+ default: return w.code;
5036
+ }
5037
+ }
5038
+
5039
+ function renderLaunch(p) {
5040
+ const out = $("launch-output");
5041
+ if (!out) return;
5042
+ out.style.display = "";
5043
+ const errMap = { no_geometry: "launch.err.no_geom", no_gpu: "launch.err.no_gpu", no_ctx: "launch.err.no_ctx" };
5044
+ if (errMap[p.verdict]) { out.innerHTML = `<div class="gc-validity-warning">⚠ ${t(errMap[p.verdict])}</div>`; return; }
5045
+
5046
+ const meta = ({
5047
+ fits: { emoji: "✅", cls: "v-yes" },
5048
+ partial: { emoji: "⚠️", cls: "v-deg" },
5049
+ too_big: { emoji: "🚨", cls: "v-no" },
5050
+ })[p.verdict] || { emoji: "❓", cls: "v-deg" };
5051
+
5052
+ const cmds = launchCommands(p);
5053
+ const td = "padding:3px 12px 3px 0;";
5054
+ const gb = n => (n == null ? "—" : n.toFixed(1) + " GB");
5055
+ const warnHtml = p.warnings.map(w => `<li>${_launchWarnText(w)}</li>`).join("");
5056
+
5057
+ out.innerHTML = `
5058
+ <p><span class="verdict-badge ${meta.cls}">${meta.emoji} ${t("launch.verdict." + p.verdict)}</span></p>
5059
+ <table style="border-collapse:collapse;font-size:0.95em;margin:0.5em 0;">
5060
+ <tr><td style="${td}">${t("launch.r.weights")}</td><td>${gb(p.weightsGB)} <span class="subtle">(${p.quant}, ${p.bpw} bpw)</span></td></tr>
5061
+ <tr><td style="${td}">${t("launch.r.kv")}</td><td>${gb(p.kvGB)} <span class="subtle">(${p.cacheType}${p.flashAttn ? ", -fa" : ""})</span></td></tr>
5062
+ <tr><td style="${td}">${t("launch.r.overhead")}</td><td>${gb(p.overheadGB)}</td></tr>
5063
+ <tr style="border-top:1px solid var(--border);"><td style="${td}"><strong>${t("launch.r.total")}</strong></td><td><strong>${gb(p.totalGB)}</strong> / ${gb(p.vramGB)} VRAM</td></tr>
5064
+ <tr><td style="${td}">${t("launch.r.ngl")}</td><td><strong>${p.allOnGpu ? `${p.nLayers} (${t("launch.r.all")})` : `${p.ngl} / ${p.nLayers}`}</strong></td></tr>
5065
+ </table>
5066
+ <h3>llama.cpp</h3>
5067
+ <pre class="diag-cmd-box">${escapeHtml(cmds.llamacpp)}</pre>
5068
+ <button id="launch-copy-llama" class="secondary">📋 ${t("launch.copy")}</button>
5069
+ <h3 style="margin-top:0.8em;">Ollama</h3>
5070
+ <pre class="diag-cmd-box">${escapeHtml(cmds.ollama)}</pre>
5071
+ ${warnHtml ? `<ul style="font-size:0.9em;margin-top:0.8em;opacity:0.9;">${warnHtml}</ul>` : ""}
5072
+ <p class="subtle" style="font-size:0.86em;">${t("launch.r.note")}</p>`;
5073
+
5074
+ $("launch-copy-llama")?.addEventListener("click", async () => {
5075
+ try { await navigator.clipboard.writeText(cmds.llamacpp); $("launch-copy-llama").textContent = "✓ " + t("yarn.copied"); } catch (e) {}
5076
+ });
5077
+ }
5078
+
5079
  // ════════════════════════════════════════════════════════════════════
5080
  // Bootstrap
5081
  // ════════════════════════════════════════════════════════════════════
registry-bootstrap/README.md CHANGED
@@ -157,7 +157,7 @@ unless otherwise noted by the contributor. The TAF Agent code itself is
157
 
158
  - 🔬 [TAF Agent web tool](https://karlesmarin.github.io/tafagent) — the diagnostic itself
159
  - 📦 [TAF Agent source](https://github.com/karlesmarin/tafagent) — open source
160
- - 📄 [Underlying paper](https://zenodo.org/records/19826343) — Marin 2026,
161
  *Predicting How Transformers Attend*
162
 
163
  ---
 
157
 
158
  - 🔬 [TAF Agent web tool](https://karlesmarin.github.io/tafagent) — the diagnostic itself
159
  - 📦 [TAF Agent source](https://github.com/karlesmarin/tafagent) — open source
160
+ - 📄 [Underlying paper](https://zenodo.org/records/20314038) — Marin 2026,
161
  *Predicting How Transformers Attend*
162
 
163
  ---
test_launch.mjs ADDED
@@ -0,0 +1,77 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import { chromium } from "playwright";
2
+ const b = await chromium.launch({ headless: true });
3
+ const p = await (await b.newContext()).newPage();
4
+ const errors=[]; const benign=s=>/40\d/.test(s);
5
+ p.on("console",m=>{if(m.type()==="error"&&!benign(m.text()))errors.push("[c]"+m.text());});
6
+ p.on("pageerror",e=>errors.push("[pe]"+e.message));
7
+ const log=s=>process.stdout.write(s+"\n"); let pass=0,fail=0;
8
+ const check=(n,c,x="")=>{log(`${c?" OK ":" FAIL"} ${n} ${x}`);c?pass++:fail++;};
9
+
10
+ await p.goto("http://127.0.0.1:8000/index.html",{waitUntil:"domcontentloaded",timeout:90000});
11
+ await p.waitForTimeout(2500);
12
+ await p.click(`.lang-btn[data-lang="en"]`); await p.waitForTimeout(200);
13
+ check("module loads, 0 errors", errors.length===0, `(${errors.length})`);
14
+
15
+ await p.click('[data-mode-link="launch"]',{timeout:5000}); await p.waitForTimeout(400);
16
+ check("section visible", await p.evaluate(()=>{const s=document.querySelector("#launch-section");return s&&getComputedStyle(s).display!=="none";}));
17
+ check("GPU presets populated", await p.evaluate(()=>document.querySelector("#launch-gpu").options.length>5));
18
+
19
+ log("\n── Fetch geometry ──");
20
+ await p.fill("#launch-model","Qwen/Qwen2.5-7B-Instruct");
21
+ await p.keyboard.press("Escape");
22
+ await p.click("#launch-fetch-btn"); await p.waitForTimeout(3500);
23
+ const st=await p.evaluate(()=>document.querySelector("#launch-status").innerText);
24
+ check("geometry fetched (layers/GQA shown)", /layers|GQA|θ=/.test(st), st.slice(0,70));
25
+ check("ctx auto-filled", await p.evaluate(()=>!!document.querySelector("#launch-ctx").value));
26
+
27
+ async function gen({quant,gpu,vram,ctx,cache,fa}){
28
+ if(quant) await p.selectOption("#launch-quant",quant);
29
+ if(gpu) await p.selectOption("#launch-gpu",gpu);
30
+ await p.fill("#launch-vram",vram!=null?String(vram):"");
31
+ if(ctx!=null) await p.fill("#launch-ctx",String(ctx));
32
+ if(cache) await p.selectOption("#launch-cache",cache);
33
+ if(fa!=null){const c=await p.isChecked("#launch-fa"); if(c!==fa) await p.click("#launch-fa");}
34
+ await p.click("#launch-gen-btn"); await p.waitForTimeout(300);
35
+ return p.evaluate(()=>{const o=document.querySelector("#launch-output");return{
36
+ verdict:o.querySelector(".verdict-badge")?.innerText?.trim()||"", text:o.innerText};});
37
+ }
38
+
39
+ log("\n── FITS case (7B Q4 on 24GB) ──");
40
+ let r=await gen({quant:"Q4_K_M",gpu:"24",vram:null,ctx:32768,cache:"fp16",fa:true});
41
+ check("verdict FITS", /FITS/.test(r.verdict), r.verdict);
42
+ check("ngl = all layers", /all|28/.test(r.text));
43
+ check("llama-server cmd present", /llama-server/.test(r.text));
44
+ check("ollama cmd present", /ollama|num_ctx/.test(r.text));
45
+ check("--no-mmap added when all-on-GPU", /--no-mmap/.test(r.text));
46
+ check("-fa present", /-fa/.test(r.text));
47
+ check("VRAM breakdown (weights/KV)", /Weights|KV cache/.test(r.text));
48
+
49
+ log("\n── PARTIAL case (7B Q4 on tiny 3GB custom) ──");
50
+ r=await gen({quant:"Q4_K_M",vram:3,ctx:8192,fa:true});
51
+ check("verdict PARTIAL or TOO BIG", /PARTIAL|TOO BIG/.test(r.verdict), r.verdict);
52
+ check("partial offload warning or cpu-only", /CPU|layers fit|smaller quant/i.test(r.text));
53
+
54
+ log("\n── cache quant changes KV flag ──");
55
+ r=await gen({quant:"Q4_K_M",gpu:"24",vram:null,ctx:32768,cache:"q8_0",fa:true});
56
+ check("KV cache q8_0 → -ctk/-ctv in cmd", /-ctk q8_0/.test(r.text));
57
+
58
+ log("\n── beyond-trained warning ──");
59
+ r=await gen({quant:"Q4_K_M",gpu:"80",vram:null,ctx:262144,cache:"fp16",fa:true});
60
+ check("L beyond trained → warning", /trained|RoPE|YaRN/i.test(r.text), "L=256K");
61
+
62
+ log("\n── error: generate before fetch (fresh) ──");
63
+ // can't easily un-fetch; just check error key exists by clearing geom via reload-free path is hard; skip
64
+
65
+ log("\n── 4 languages ──");
66
+ for(const lang of ["es","fr","zh","en"]){
67
+ await p.click(`.lang-btn[data-lang="${lang}"]`); await p.waitForTimeout(250);
68
+ const lbl=await p.evaluate(()=>document.querySelector('.mode-btn[data-mode="launch"]')?.textContent?.trim());
69
+ check(`${lang}: tab label`, lbl&&lbl.length>3, lbl);
70
+ }
71
+
72
+ check("copy button present", await p.evaluate(()=>!!document.querySelector("#launch-copy-llama")));
73
+
74
+ log(`\n=== ${pass} passed, ${fail} failed · JS errors: ${errors.length} ===`);
75
+ errors.slice(0,10).forEach(e=>log(e));
76
+ await b.close();
77
+ process.exit(fail>0?1:0);