Spaces:
Running
v0.9.4: Launch-Flag Generator mode + Zenodo record update
Browse filesLaunch-Flag Generator: model + GPU + context → the exact llama.cpp / Ollama
launch command, the question the VRAM calculators don't answer (they say
"fits", not "here's the command").
- js/launch_flags.js: VRAM model (weights from bits/param via exact decoder
param count — attention+SwiGLU+embeddings with GQA, not the 12·h² shortcut
that undercounts large-FFN models like Qwen2.5-7B; KV from head geometry;
coarse scratch). Computes -ngl layer offload, fit verdict, and the TAF
horizon check: warns when target context is past d_horizon (KV memory
wasted). launchCommands() emits llama-server + Ollama snippets with -c, -fa,
-ctk/-ctv, --no-mmap (Blackwell OOM fix).
- index.html: tab + tile + #launch-section (GPU presets, quant, cache, FA) +
help v0.9.4. main.js: import, wiring, autocomplete auto-fetch, render.
- i18n.js: full EN/ES/FR/ZH.
Also: updated the paper Zenodo link 19826343 → 20314038 across the app
(index.html, i18n.js 4 langs) and tracked docs/README citations.
Test (test_launch.mjs): 21/21 — fetch geometry, FITS/PARTIAL verdicts,
--no-mmap on full offload, -ctk on cache quant, beyond-trained warning, 4
languages. 25 modes total, 0 JS errors.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- README.md +2 -2
- docs/hf-post-v053-fix.md +1 -1
- hf-post-announcement.md +1 -1
- hf-space-readme.md +1 -1
- index.html +69 -1
- js/i18n.js +168 -4
- js/launch_flags.js +170 -0
- js/main.js +126 -1
- registry-bootstrap/README.md +1 -1
- test_launch.mjs +77 -0
|
@@ -46,7 +46,7 @@ language:
|
|
| 46 |
|
| 47 |
**🌐 Live**: https://karlesmarin.github.io/tafagent · HF Space: https://huggingface.co/spaces/karlexmarin/taf-agent
|
| 48 |
**📦 Source**: https://github.com/karlesmarin/tafagent · Lean repo: https://github.com/karlesmarin/lean-taf
|
| 49 |
-
**📄 Paper**: [Predicting How Transformers Attend — Marin 2026](https://zenodo.org/records/
|
| 50 |
**🗂️ Dataset**: [taf-attention-decay (58 measurements, 32 models)](https://huggingface.co/datasets/karlexmarin/taf-attention-decay)
|
| 51 |
|
| 52 |
---
|
|
@@ -413,7 +413,7 @@ If this tool helps you — paper or code:
|
|
| 413 |
Analytic Power-Law Theory, Phase Transitions, and Practical Compression
|
| 414 |
Tools},
|
| 415 |
year = {2026},
|
| 416 |
-
url = {https://zenodo.org/records/
|
| 417 |
}
|
| 418 |
|
| 419 |
@misc{marin2026tafagent,
|
|
|
|
| 46 |
|
| 47 |
**🌐 Live**: https://karlesmarin.github.io/tafagent · HF Space: https://huggingface.co/spaces/karlexmarin/taf-agent
|
| 48 |
**📦 Source**: https://github.com/karlesmarin/tafagent · Lean repo: https://github.com/karlesmarin/lean-taf
|
| 49 |
+
**📄 Paper**: [Predicting How Transformers Attend — Marin 2026](https://zenodo.org/records/20314038)
|
| 50 |
**🗂️ Dataset**: [taf-attention-decay (58 measurements, 32 models)](https://huggingface.co/datasets/karlexmarin/taf-attention-decay)
|
| 51 |
|
| 52 |
---
|
|
|
|
| 413 |
Analytic Power-Law Theory, Phase Transitions, and Practical Compression
|
| 414 |
Tools},
|
| 415 |
year = {2026},
|
| 416 |
+
url = {https://zenodo.org/records/20314038},
|
| 417 |
}
|
| 418 |
|
| 419 |
@misc{marin2026tafagent,
|
|
@@ -156,5 +156,5 @@ If you spot anything else wrong — please open an issue.
|
|
| 156 |
**Links**:
|
| 157 |
- Live: https://huggingface.co/spaces/karlexmarin/taf-agent
|
| 158 |
- Source: https://github.com/karlesmarin/tafagent
|
| 159 |
-
- Paper: https://zenodo.org/records/
|
| 160 |
- Dataset: https://huggingface.co/datasets/karlexmarin/taf-attention-decay
|
|
|
|
| 156 |
**Links**:
|
| 157 |
- Live: https://huggingface.co/spaces/karlexmarin/taf-agent
|
| 158 |
- Source: https://github.com/karlesmarin/tafagent
|
| 159 |
+
- Paper: https://zenodo.org/records/20314038
|
| 160 |
- Dataset: https://huggingface.co/datasets/karlexmarin/taf-attention-decay
|
|
@@ -5,7 +5,7 @@ No server, no auth, no cost. Runs entirely in your browser.
|
|
| 5 |
|
| 6 |
🌐 **Try it**: https://huggingface.co/spaces/karlexmarin/taf-agent
|
| 7 |
📦 **Source**: https://github.com/karlesmarin/tafagent
|
| 8 |
-
📄 **Paper**: [Predicting How Transformers Attend](https://zenodo.org/records/
|
| 9 |
|
| 10 |
## What it answers
|
| 11 |
|
|
|
|
| 5 |
|
| 6 |
🌐 **Try it**: https://huggingface.co/spaces/karlexmarin/taf-agent
|
| 7 |
📦 **Source**: https://github.com/karlesmarin/tafagent
|
| 8 |
+
📄 **Paper**: [Predicting How Transformers Attend](https://zenodo.org/records/20314038)
|
| 9 |
|
| 10 |
## What it answers
|
| 11 |
|
|
@@ -66,7 +66,7 @@ Predicts practical viability of any transformer LLM from its config alone:
|
|
| 66 |
|
| 67 |
## Underlying paper
|
| 68 |
|
| 69 |
-
[Marin 2026 — Predicting How Transformers Attend](https://zenodo.org/records/
|
| 70 |
|
| 71 |
## Source
|
| 72 |
|
|
|
|
| 66 |
|
| 67 |
## Underlying paper
|
| 68 |
|
| 69 |
+
[Marin 2026 — Predicting How Transformers Attend](https://zenodo.org/records/20314038)
|
| 70 |
|
| 71 |
## Source
|
| 72 |
|
|
@@ -249,6 +249,9 @@
|
|
| 249 |
<p><strong data-i18n="help.v091.gguf.title">🧊 GGUF Validity Bridge</strong></p>
|
| 250 |
<p data-i18n="help.v091.gguf.body">The dozen GGUF/VRAM calculators (NyxKrage, oobabooga, …) read a <code>.gguf</code> header to tell you if a quant <em>fits in your GPU</em>. This reads the same header — via HTTP Range, so no multi-GB download — and answers the question they skip: <em>does it fit AND still work?</em> Paste a GGUF repo, pick a quant file; the bridge pulls <code>rope_theta</code>, <code>context_length</code>, the quant scheme (from <code>general.file_type</code> or the filename), and head geometry, then runs TAF's γ_Padé / d_horizon plus the architecture-aware quant-regime γ-shift. Output: effective attention horizon at the trained context, how far the quant erodes γ (and ΔPPL) for <em>this</em> model, and a verdict — HEALTHY / USABLE-WITH-CARE / DEGRADES. <em>Use case</em>: 'unsloth/Qwen3.5-9B-GGUF Q4_K_M fits 8GB — but is it brain-dead past 30K?' → see the horizon and the Q4 γ-penalty before you download 6 GB.</p>
|
| 251 |
|
|
|
|
|
|
|
|
|
|
| 252 |
<h3 data-i18n="help.audit.title">The audit chain</h3>
|
| 253 |
<p data-i18n="help.audit.body">Every result shows the full <strong>Computation Chain</strong> — each formula step with its inputs,
|
| 254 |
output, and interpretation. Click any step to expand. Cite section numbers (§26.1, §19.1, etc.) refer
|
|
@@ -282,7 +285,7 @@
|
|
| 282 |
|
| 283 |
<h3 data-i18n="help.source.title">Source & paper</h3>
|
| 284 |
<p data-i18n="help.source.body">Source code: <a href="https://github.com/karlesmarin/tafagent" target="_blank">github.com/karlesmarin/tafagent</a><br>
|
| 285 |
-
Paper: <em>Marin 2026 — Predicting How Transformers Attend</em> (<a href="https://zenodo.org/records/
|
| 286 |
Dataset: <a href="https://huggingface.co/datasets/karlexmarin/taf-attention-decay" target="_blank">taf-attention-decay</a> — 58 γ-measurements across 32 models (CC-BY-4.0)</p>
|
| 287 |
</div>
|
| 288 |
</div>
|
|
@@ -412,6 +415,7 @@
|
|
| 412 |
<button data-mode-link="quant" data-i18n="modes.quant">⚖️ Quant</button>
|
| 413 |
<button data-mode-link="yarn" data-i18n="modes.yarn">🧵 YaRN Planner</button>
|
| 414 |
<button data-mode-link="gguf" data-i18n="modes.gguf">🧊 GGUF Bridge</button>
|
|
|
|
| 415 |
<button data-mode-link="inspector" data-i18n="modes.inspector">🔍 Inspect config</button>
|
| 416 |
</div>
|
| 417 |
</div>
|
|
@@ -508,6 +512,7 @@
|
|
| 508 |
<button class="mode-btn" data-mode="hub" role="tab" aria-selected="false" data-i18n="modes.hub">🧭 Solutions</button>
|
| 509 |
<button class="mode-btn" data-mode="yarn" role="tab" aria-selected="false" data-i18n="modes.yarn">🧵 YaRN Planner</button>
|
| 510 |
<button class="mode-btn" data-mode="gguf" role="tab" aria-selected="false" data-i18n="modes.gguf">🧊 GGUF Bridge</button>
|
|
|
|
| 511 |
</div>
|
| 512 |
<p id="mode-desc" class="recipe-desc" data-i18n="modes.desc">
|
| 513 |
<strong>Quickest start</strong>: paste any HuggingFace model id (e.g. <code>meta-llama/Meta-Llama-3-8B</code>),
|
|
@@ -1333,6 +1338,69 @@
|
|
| 1333 |
<div id="gguf-output" style="display:none; margin-top:1em;"></div>
|
| 1334 |
</section>
|
| 1335 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1336 |
<!-- Recipe selector (mode=recipe) -->
|
| 1337 |
<section id="recipe-section" style="display:none;">
|
| 1338 |
<h2 data-i18n="recipe.title">📋 Recipe</h2>
|
|
|
|
| 249 |
<p><strong data-i18n="help.v091.gguf.title">🧊 GGUF Validity Bridge</strong></p>
|
| 250 |
<p data-i18n="help.v091.gguf.body">The dozen GGUF/VRAM calculators (NyxKrage, oobabooga, …) read a <code>.gguf</code> header to tell you if a quant <em>fits in your GPU</em>. This reads the same header — via HTTP Range, so no multi-GB download — and answers the question they skip: <em>does it fit AND still work?</em> Paste a GGUF repo, pick a quant file; the bridge pulls <code>rope_theta</code>, <code>context_length</code>, the quant scheme (from <code>general.file_type</code> or the filename), and head geometry, then runs TAF's γ_Padé / d_horizon plus the architecture-aware quant-regime γ-shift. Output: effective attention horizon at the trained context, how far the quant erodes γ (and ΔPPL) for <em>this</em> model, and a verdict — HEALTHY / USABLE-WITH-CARE / DEGRADES. <em>Use case</em>: 'unsloth/Qwen3.5-9B-GGUF Q4_K_M fits 8GB — but is it brain-dead past 30K?' → see the horizon and the Q4 γ-penalty before you download 6 GB.</p>
|
| 251 |
|
| 252 |
+
<p><strong data-i18n="help.v094.launch.title">🚀 Launch-Flag Generator</strong></p>
|
| 253 |
+
<p data-i18n="help.v094.launch.body">The VRAM calculators tell you <em>whether</em> a model fits; they don't hand you the command. This does. Pick a model (fetches geometry from HF <code>config.json</code>), a quant, a GPU and a target context — it computes the VRAM breakdown (weights + KV cache + scratch), how many layers to offload (<code>-ngl</code>), and emits the copy-paste <code>llama-server</code> and Ollama commands with <code>-c</code> context, <code>-fa</code> flash-attention, KV-cache type, and <code>--no-mmap</code> (the Blackwell OOM fix: force all weights into physical VRAM). Plus the TAF reality check no calculator gives: if you're allocating KV for a context past the model's d_horizon, it warns you that memory is wasted — the attention won't reach there. <em>Use case</em>: 'What <code>-ngl</code> for Llama-70B-Q4 on my 4090?' → 39 of 80 layers, exact command, and a note if your context is past the usable horizon.</p>
|
| 254 |
+
|
| 255 |
<h3 data-i18n="help.audit.title">The audit chain</h3>
|
| 256 |
<p data-i18n="help.audit.body">Every result shows the full <strong>Computation Chain</strong> — each formula step with its inputs,
|
| 257 |
output, and interpretation. Click any step to expand. Cite section numbers (§26.1, §19.1, etc.) refer
|
|
|
|
| 285 |
|
| 286 |
<h3 data-i18n="help.source.title">Source & paper</h3>
|
| 287 |
<p data-i18n="help.source.body">Source code: <a href="https://github.com/karlesmarin/tafagent" target="_blank">github.com/karlesmarin/tafagent</a><br>
|
| 288 |
+
Paper: <em>Marin 2026 — Predicting How Transformers Attend</em> (<a href="https://zenodo.org/records/20314038" target="_blank">Zenodo</a>; arXiv forthcoming)<br>
|
| 289 |
Dataset: <a href="https://huggingface.co/datasets/karlexmarin/taf-attention-decay" target="_blank">taf-attention-decay</a> — 58 γ-measurements across 32 models (CC-BY-4.0)</p>
|
| 290 |
</div>
|
| 291 |
</div>
|
|
|
|
| 415 |
<button data-mode-link="quant" data-i18n="modes.quant">⚖️ Quant</button>
|
| 416 |
<button data-mode-link="yarn" data-i18n="modes.yarn">🧵 YaRN Planner</button>
|
| 417 |
<button data-mode-link="gguf" data-i18n="modes.gguf">🧊 GGUF Bridge</button>
|
| 418 |
+
<button data-mode-link="launch" data-i18n="modes.launch">🚀 Launch Flags</button>
|
| 419 |
<button data-mode-link="inspector" data-i18n="modes.inspector">🔍 Inspect config</button>
|
| 420 |
</div>
|
| 421 |
</div>
|
|
|
|
| 512 |
<button class="mode-btn" data-mode="hub" role="tab" aria-selected="false" data-i18n="modes.hub">🧭 Solutions</button>
|
| 513 |
<button class="mode-btn" data-mode="yarn" role="tab" aria-selected="false" data-i18n="modes.yarn">🧵 YaRN Planner</button>
|
| 514 |
<button class="mode-btn" data-mode="gguf" role="tab" aria-selected="false" data-i18n="modes.gguf">🧊 GGUF Bridge</button>
|
| 515 |
+
<button class="mode-btn" data-mode="launch" role="tab" aria-selected="false" data-i18n="modes.launch">🚀 Launch Flags</button>
|
| 516 |
</div>
|
| 517 |
<p id="mode-desc" class="recipe-desc" data-i18n="modes.desc">
|
| 518 |
<strong>Quickest start</strong>: paste any HuggingFace model id (e.g. <code>meta-llama/Meta-Llama-3-8B</code>),
|
|
|
|
| 1338 |
<div id="gguf-output" style="display:none; margin-top:1em;"></div>
|
| 1339 |
</section>
|
| 1340 |
|
| 1341 |
+
<!-- Launch-flag generator (mode=launch) -->
|
| 1342 |
+
<section id="launch-section" style="display:none;">
|
| 1343 |
+
<h2><span data-i18n="launch.title">🚀 Launch-Flag Generator</span>
|
| 1344 |
+
<span class="info"><span class="tooltip" data-i18n="launch.tip">
|
| 1345 |
+
<strong>Exact flags + why, not just "fits"</strong>. The VRAM calculators tell you whether a
|
| 1346 |
+
model fits. This gives you the copy-paste <code>llama.cpp</code> / <code>Ollama</code> command —
|
| 1347 |
+
<code>-ngl</code> layers to offload, <code>-c</code> context, <code>--no-mmap</code>,
|
| 1348 |
+
KV-cache type — AND the TAF reality check: if you allocate KV for 128K but the model's
|
| 1349 |
+
attention horizon is 32K, that VRAM is wasted.
|
| 1350 |
+
</span></span>
|
| 1351 |
+
</h2>
|
| 1352 |
+
<p class="recipe-desc" data-i18n="launch.desc">
|
| 1353 |
+
Pick a model, GPU and target context → get the exact launch command, a VRAM breakdown
|
| 1354 |
+
(weights + KV cache + overhead), and how many layers to offload. Solves the recurring
|
| 1355 |
+
"what <code>-ngl</code> do I use?" / Blackwell OOM guesswork.
|
| 1356 |
+
</p>
|
| 1357 |
+
|
| 1358 |
+
<div class="form-row">
|
| 1359 |
+
<label for="launch-model" data-i18n="launch.model_label">HF model id:</label>
|
| 1360 |
+
<input type="text" id="launch-model" placeholder="Qwen/Qwen2.5-7B-Instruct">
|
| 1361 |
+
<button id="launch-fetch-btn" class="secondary" data-i18n="launch.fetch_btn">📥 Fetch geometry</button>
|
| 1362 |
+
</div>
|
| 1363 |
+
<span id="launch-status" class="subtle"></span>
|
| 1364 |
+
|
| 1365 |
+
<div class="form-row">
|
| 1366 |
+
<label for="launch-quant" data-i18n="launch.quant_label">Quant:</label>
|
| 1367 |
+
<select id="launch-quant">
|
| 1368 |
+
<option value="Q4_K_M">Q4_K_M (4-bit, sweet spot)</option>
|
| 1369 |
+
<option value="Q8_0">Q8_0 (8-bit)</option>
|
| 1370 |
+
<option value="Q6_K">Q6_K</option>
|
| 1371 |
+
<option value="Q5_K_M">Q5_K_M</option>
|
| 1372 |
+
<option value="Q4_0">Q4_0</option>
|
| 1373 |
+
<option value="Q3_K_M">Q3_K_M</option>
|
| 1374 |
+
<option value="Q2_K">Q2_K (extreme)</option>
|
| 1375 |
+
<option value="F16">F16 (full)</option>
|
| 1376 |
+
</select>
|
| 1377 |
+
</div>
|
| 1378 |
+
<div class="form-row">
|
| 1379 |
+
<label for="launch-gpu" data-i18n="launch.gpu_label">GPU:</label>
|
| 1380 |
+
<select id="launch-gpu"></select>
|
| 1381 |
+
<input type="number" id="launch-vram" placeholder="or custom VRAM (GB)" min="1" style="width:11em;">
|
| 1382 |
+
</div>
|
| 1383 |
+
<div class="form-row">
|
| 1384 |
+
<label for="launch-ctx" data-i18n="launch.ctx_label">Target context L:</label>
|
| 1385 |
+
<input type="number" id="launch-ctx" placeholder="32768" min="256">
|
| 1386 |
+
</div>
|
| 1387 |
+
<div class="form-row">
|
| 1388 |
+
<label data-i18n="launch.adv_label">Advanced:</label>
|
| 1389 |
+
<span>
|
| 1390 |
+
<label data-i18n="launch.cache_label">KV cache:</label>
|
| 1391 |
+
<select id="launch-cache">
|
| 1392 |
+
<option value="fp16">fp16</option>
|
| 1393 |
+
<option value="q8_0">q8_0 (½ KV)</option>
|
| 1394 |
+
<option value="q4_0">q4_0 (¼ KV)</option>
|
| 1395 |
+
</select>
|
| 1396 |
+
|
| 1397 |
+
<label><input type="checkbox" id="launch-fa" checked> <span data-i18n="launch.fa_label">Flash attention (-fa)</span></label>
|
| 1398 |
+
</span>
|
| 1399 |
+
</div>
|
| 1400 |
+
<button id="launch-gen-btn" data-i18n="launch.gen_btn">🚀 Generate flags</button>
|
| 1401 |
+
<div id="launch-output" style="display:none; margin-top:1em;"></div>
|
| 1402 |
+
</section>
|
| 1403 |
+
|
| 1404 |
<!-- Recipe selector (mode=recipe) -->
|
| 1405 |
<section id="recipe-section" style="display:none;">
|
| 1406 |
<h2 data-i18n="recipe.title">📋 Recipe</h2>
|
|
@@ -429,6 +429,47 @@ export const TRANSLATIONS = {
|
|
| 429 |
"mode_desc.yarn": "Generate the exact rope_scaling config to extend a model past its trained context — plus a TAF verdict on whether attention quality actually holds at the target length.",
|
| 430 |
"modes.gguf": "🧊 GGUF Bridge",
|
| 431 |
"mode_desc.gguf": "Read a GGUF file's metadata header (rope_theta, context_length, quant) in your browser and get a TAF quality verdict — the question the VRAM calculators skip: fits AND works?",
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 432 |
"gguf.title": "🧊 GGUF Validity Bridge",
|
| 433 |
"gguf.tip": "<strong>Fits in VRAM ≠ works</strong>. The GGUF/VRAM calculators read a model's metadata to tell you if a quant <em>fits in your GPU</em>. This reads the SAME metadata (rope_theta, context_length, quant scheme, head geometry) straight from the <code>.gguf</code> header via HTTP Range — no multi-GB download — and answers the question they don't: does attention quality actually hold, and how much does the quant erode it (γ-shift, ΔPPL)?",
|
| 434 |
"gguf.desc": "Paste a GGUF repo (e.g. <code>Qwen/Qwen2.5-7B-Instruct-GGUF</code>), pick a quant file, and get a TAF quality verdict: the model's effective attention horizon, plus how much the chosen quantization shifts γ for <em>this specific architecture</em>. Reads only the file header in your browser.",
|
|
@@ -1059,7 +1100,7 @@ export const TRANSLATIONS = {
|
|
| 1059 |
"help.privacy.title": "Privacy",
|
| 1060 |
"help.privacy.body": "Everything runs in your browser. No telemetry, no analytics, no data sent anywhere. Even the LLM model runs locally via WebGPU/WebAssembly. Your model_ids and questions never leave this page.",
|
| 1061 |
"help.source.title": "Source & paper",
|
| 1062 |
-
"help.source.body": "Source code: <a href=\"https://github.com/karlesmarin/tafagent\" target=\"_blank\">github.com/karlesmarin/tafagent</a><br>Paper: <em>Marin 2026 — Predicting How Transformers Attend</em> (<a href=\"https://zenodo.org/records/
|
| 1063 |
|
| 1064 |
"footer.text": "© 2026 Carles Marin · Apache-2.0 · independent research · the tool that closes the loop of the paper.",
|
| 1065 |
|
|
@@ -1778,6 +1819,47 @@ export const TRANSLATIONS = {
|
|
| 1778 |
"mode_desc.yarn": "Genera la configuración rope_scaling exacta para extender un modelo más allá de su contexto entrenado — más un veredicto TAF sobre si la calidad de atención aguanta realmente a la longitud objetivo.",
|
| 1779 |
"modes.gguf": "🧊 Puente GGUF",
|
| 1780 |
"mode_desc.gguf": "Lee la cabecera de metadata de un archivo GGUF (rope_theta, context_length, quant) en tu navegador y obtén un veredicto de calidad TAF — la pregunta que los calculadores de VRAM ignoran: ¿cabe Y funciona?",
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1781 |
"gguf.title": "🧊 Puente de validez GGUF",
|
| 1782 |
"gguf.tip": "<strong>Caber en VRAM ≠ funcionar</strong>. Los calculadores GGUF/VRAM leen la metadata de un modelo para decirte si un quant <em>cabe en tu GPU</em>. Esto lee la MISMA metadata (rope_theta, context_length, esquema de quant, geometría de cabezas) directamente de la cabecera <code>.gguf</code> vía HTTP Range — sin descargar GB — y responde lo que ellos no: ¿aguanta de verdad la calidad de atención, y cuánto la erosiona el quant (γ-shift, ΔPPL)?",
|
| 1783 |
"gguf.desc": "Pega un repo GGUF (p.ej. <code>Qwen/Qwen2.5-7B-Instruct-GGUF</code>), elige un archivo de quant, y obtén un veredicto de calidad TAF: el horizonte de atención efectivo del modelo, más cuánto desplaza γ la cuantización elegida para <em>esta arquitectura concreta</em>. Solo lee la cabecera del archivo en tu navegador.",
|
|
@@ -2408,7 +2490,7 @@ export const TRANSLATIONS = {
|
|
| 2408 |
"help.privacy.title": "Privacidad",
|
| 2409 |
"help.privacy.body": "Todo corre en tu navegador. Sin telemetría, sin analytics, sin datos enviados a ningún sitio. Incluso el modelo LLM corre localmente vía WebGPU/WebAssembly. Tus model_ids y preguntas nunca abandonan esta página.",
|
| 2410 |
"help.source.title": "Código fuente y paper",
|
| 2411 |
-
"help.source.body": "Código: <a href=\"https://github.com/karlesmarin/tafagent\" target=\"_blank\">github.com/karlesmarin/tafagent</a><br>Paper: <em>Marin 2026 — Predicting How Transformers Attend</em> (<a href=\"https://zenodo.org/records/
|
| 2412 |
|
| 2413 |
"footer.text": "© 2026 Carles Marin · Apache-2.0 · investigación independiente · la herramienta que cierra el círculo del paper.",
|
| 2414 |
},
|
|
@@ -2981,6 +3063,47 @@ export const TRANSLATIONS = {
|
|
| 2981 |
"mode_desc.yarn": "Génère la configuration rope_scaling exacte pour étendre un modèle au-delà de son contexte d'entraînement — plus un verdict TAF sur la tenue réelle de la qualité d'attention à la longueur cible.",
|
| 2982 |
"modes.gguf": "🧊 Pont GGUF",
|
| 2983 |
"mode_desc.gguf": "Lit l'en-tête de métadonnées d'un fichier GGUF (rope_theta, context_length, quant) dans votre navigateur et donne un verdict de qualité TAF — la question que les calculateurs de VRAM ignorent : tient ET fonctionne ?",
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 2984 |
"gguf.title": "🧊 Pont de validité GGUF",
|
| 2985 |
"gguf.tip": "<strong>Tenir dans la VRAM ≠ fonctionner</strong>. Les calculateurs GGUF/VRAM lisent les métadonnées d'un modèle pour dire si un quant <em>tient dans le GPU</em>. Ceci lit les MÊMES métadonnées (rope_theta, context_length, schéma de quant, géométrie des têtes) directement depuis l'en-tête <code>.gguf</code> via HTTP Range — sans télécharger des Go — et répond à ce qu'ils n'abordent pas : la qualité d'attention tient-elle vraiment, et de combien le quant l'érode-t-il (γ-shift, ΔPPL) ?",
|
| 2986 |
"gguf.desc": "Collez un dépôt GGUF (ex. <code>Qwen/Qwen2.5-7B-Instruct-GGUF</code>), choisissez un fichier de quant, et obtenez un verdict de qualité TAF : l'horizon d'attention effectif du modèle, plus de combien la quantification choisie décale γ pour <em>cette architecture précise</em>. Ne lit que l'en-tête du fichier dans votre navigateur.",
|
|
@@ -3611,7 +3734,7 @@ export const TRANSLATIONS = {
|
|
| 3611 |
"help.privacy.title": "Confidentialité",
|
| 3612 |
"help.privacy.body": "Tout s'exécute dans votre navigateur. Pas de télémétrie, pas d'analytique, pas de données envoyées ailleurs. Même le modèle LLM s'exécute localement via WebGPU/WebAssembly. Vos model_ids et questions ne quittent jamais cette page.",
|
| 3613 |
"help.source.title": "Code source et paper",
|
| 3614 |
-
"help.source.body": "Code : <a href=\"https://github.com/karlesmarin/tafagent\" target=\"_blank\">github.com/karlesmarin/tafagent</a><br>Paper : <em>Marin 2026 — Predicting How Transformers Attend</em> (<a href=\"https://zenodo.org/records/
|
| 3615 |
|
| 3616 |
"footer.text": "© 2026 Carles Marin · Apache-2.0 · recherche indépendante · l'outil qui ferme la boucle du paper.",
|
| 3617 |
},
|
|
@@ -4184,6 +4307,47 @@ export const TRANSLATIONS = {
|
|
| 4184 |
"mode_desc.yarn": "生成精确的 rope_scaling 配置以将模型扩展到训练上下文之外 —— 外加 TAF 裁决:在目标长度下注意力质量是否真的撑得住。",
|
| 4185 |
"modes.gguf": "🧊 GGUF 桥",
|
| 4186 |
"mode_desc.gguf": "在浏览器内读取 GGUF 文件的元数据头(rope_theta、context_length、量化),给出 TAF 质量裁决 —— 显存计算器跳过的那个问题:塞得进且还能用吗?",
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 4187 |
"gguf.title": "🧊 GGUF 有效性桥",
|
| 4188 |
"gguf.tip": "<strong>塞进显存 ≠ 能用</strong>。GGUF/显存计算器读取模型元数据来告诉你某量化<em>是否塞得进 GPU</em>。本工具通过 HTTP Range 直接从 <code>.gguf</code> 头读取同样的元数据(rope_theta、context_length、量化方案、注意力头几何)—— 无需下载数 GB —— 并回答它们不答的:注意力质量是否真的撑得住,量化又侵蚀了多少(γ-shift、ΔPPL)?",
|
| 4189 |
"gguf.desc": "粘贴一个 GGUF 仓库(如 <code>Qwen/Qwen2.5-7B-Instruct-GGUF</code>),选择一个量化文件,获得 TAF 质量裁决:模型的有效注意力视界,以及所选量化对<em>这个具体架构</em>的 γ 位移有多大。只在浏览器内读取文件头。",
|
|
@@ -4814,7 +4978,7 @@ export const TRANSLATIONS = {
|
|
| 4814 |
"help.privacy.title": "隐私",
|
| 4815 |
"help.privacy.body": "一切都在您的浏览器中运行。无遥测,无分析,无数据发送到任何地方。即使是 LLM 模型也通过 WebGPU/WebAssembly 在本地运行。您的 model_ids 和问题永不离开此页面。",
|
| 4816 |
"help.source.title": "源代码和论文",
|
| 4817 |
-
"help.source.body": "源代码: <a href=\"https://github.com/karlesmarin/tafagent\" target=\"_blank\">github.com/karlesmarin/tafagent</a><br>论文: <em>Marin 2026 — Predicting How Transformers Attend</em> (<a href=\"https://zenodo.org/records/
|
| 4818 |
|
| 4819 |
"footer.text": "© 2026 Carles Marin · Apache-2.0 · 独立研究 · 闭合论文回路的工具。",
|
| 4820 |
},
|
|
|
|
| 429 |
"mode_desc.yarn": "Generate the exact rope_scaling config to extend a model past its trained context — plus a TAF verdict on whether attention quality actually holds at the target length.",
|
| 430 |
"modes.gguf": "🧊 GGUF Bridge",
|
| 431 |
"mode_desc.gguf": "Read a GGUF file's metadata header (rope_theta, context_length, quant) in your browser and get a TAF quality verdict — the question the VRAM calculators skip: fits AND works?",
|
| 432 |
+
"modes.launch": "🚀 Launch Flags",
|
| 433 |
+
"mode_desc.launch": "Model + GPU + context → the exact llama.cpp / Ollama launch command (-ngl, -c, --no-mmap, KV-cache type) with a VRAM breakdown and a TAF warning when your context is past the usable horizon.",
|
| 434 |
+
"launch.title": "🚀 Launch-Flag Generator",
|
| 435 |
+
"launch.tip": "<strong>Exact flags + why, not just \"fits\"</strong>. The VRAM calculators tell you whether a model fits. This gives you the copy-paste <code>llama.cpp</code> / <code>Ollama</code> command — <code>-ngl</code> layers to offload, <code>-c</code> context, <code>--no-mmap</code>, KV-cache type — AND the TAF reality check: if you allocate KV for 128K but the model's attention horizon is 32K, that VRAM is wasted.",
|
| 436 |
+
"launch.desc": "Pick a model, GPU and target context → get the exact launch command, a VRAM breakdown (weights + KV cache + overhead), and how many layers to offload. Solves the recurring \"what <code>-ngl</code> do I use?\" / Blackwell OOM guesswork.",
|
| 437 |
+
"launch.model_label": "HF model id:",
|
| 438 |
+
"launch.fetch_btn": "📥 Fetch geometry",
|
| 439 |
+
"launch.quant_label": "Quant:",
|
| 440 |
+
"launch.gpu_label": "GPU:",
|
| 441 |
+
"launch.ctx_label": "Target context L:",
|
| 442 |
+
"launch.adv_label": "Advanced:",
|
| 443 |
+
"launch.cache_label": "KV cache:",
|
| 444 |
+
"launch.fa_label": "Flash attention (-fa)",
|
| 445 |
+
"launch.gen_btn": "🚀 Generate flags",
|
| 446 |
+
"launch.need_id": "Enter a model id like 'Qwen/Qwen2.5-7B-Instruct'",
|
| 447 |
+
"launch.fetching": "Fetching config.json from HF Hub…",
|
| 448 |
+
"launch.layers": "layers",
|
| 449 |
+
"launch.fetched_hint": "Pick GPU + context, then Generate flags.",
|
| 450 |
+
"launch.need_fetch": "Fetch a model first (📥 Fetch geometry).",
|
| 451 |
+
"launch.verdict.fits": "FITS — fully on GPU",
|
| 452 |
+
"launch.verdict.partial": "PARTIAL — some layers on CPU (slower)",
|
| 453 |
+
"launch.verdict.too_big": "TOO BIG — won't fit any layers on this GPU",
|
| 454 |
+
"launch.r.weights": "Weights",
|
| 455 |
+
"launch.r.kv": "KV cache",
|
| 456 |
+
"launch.r.overhead": "Overhead / scratch",
|
| 457 |
+
"launch.r.total": "Total",
|
| 458 |
+
"launch.r.ngl": "Layers to offload (-ngl)",
|
| 459 |
+
"launch.r.all": "all",
|
| 460 |
+
"launch.r.note": "VRAM is an estimate (weights from bits/param, KV from head geometry, scratch coarse). d_horizon from γ_Padé. Verify the fit with a real load — leave ~1 GB headroom.",
|
| 461 |
+
"launch.warn.horizon_wasted": "Target context is well past the model's attention horizon — KV memory for context beyond it is wasted; the model won't attend there. (TAF)",
|
| 462 |
+
"launch.warn.beyond_trained": "L exceeds the trained context — you also need RoPE scaling to position-encode that far (see the YaRN Planner).",
|
| 463 |
+
"launch.warn.no_mmap": "All layers fit → added --no-mmap to force weights into physical VRAM (avoids the Blackwell illegal-memory / OOM-at-load issue).",
|
| 464 |
+
"launch.warn.partial": "Only some layers fit on GPU — the rest run on CPU (much slower). Drop to a smaller quant or shorter context to fit fully.",
|
| 465 |
+
"launch.warn.cpu_only": "Won't fit any layers at these settings — CPU only. Use a smaller quant/context or a bigger GPU.",
|
| 466 |
+
"launch.warn.no_params": "Couldn't read parameter count — weights size is a rough estimate from geometry.",
|
| 467 |
+
"launch.err.no_geom": "Fetch a model first to read its geometry.",
|
| 468 |
+
"launch.err.no_gpu": "Pick a GPU or enter a custom VRAM size.",
|
| 469 |
+
"launch.err.no_ctx": "Enter a target context length L.",
|
| 470 |
+
"launch.copy": "Copy command",
|
| 471 |
+
"help.v094.launch.title": "🚀 Launch-Flag Generator",
|
| 472 |
+
"help.v094.launch.body": "The VRAM calculators tell you <em>whether</em> a model fits; they don't hand you the command. This does. Pick a model (fetches geometry from HF <code>config.json</code>), a quant, a GPU and a target context — it computes the VRAM breakdown (weights + KV cache + scratch), how many layers to offload (<code>-ngl</code>), and emits the copy-paste <code>llama-server</code> and Ollama commands with <code>-c</code> context, <code>-fa</code> flash-attention, KV-cache type, and <code>--no-mmap</code> (the Blackwell OOM fix). Plus the TAF reality check no calculator gives: if you're allocating KV for a context past the model's d_horizon, it warns you that memory is wasted. <em>Use case</em>: 'What <code>-ngl</code> for Llama-70B-Q4 on my 4090?' → 39 of 80 layers, exact command, and a note if your context is past the usable horizon.",
|
| 473 |
"gguf.title": "🧊 GGUF Validity Bridge",
|
| 474 |
"gguf.tip": "<strong>Fits in VRAM ≠ works</strong>. The GGUF/VRAM calculators read a model's metadata to tell you if a quant <em>fits in your GPU</em>. This reads the SAME metadata (rope_theta, context_length, quant scheme, head geometry) straight from the <code>.gguf</code> header via HTTP Range — no multi-GB download — and answers the question they don't: does attention quality actually hold, and how much does the quant erode it (γ-shift, ΔPPL)?",
|
| 475 |
"gguf.desc": "Paste a GGUF repo (e.g. <code>Qwen/Qwen2.5-7B-Instruct-GGUF</code>), pick a quant file, and get a TAF quality verdict: the model's effective attention horizon, plus how much the chosen quantization shifts γ for <em>this specific architecture</em>. Reads only the file header in your browser.",
|
|
|
|
| 1100 |
"help.privacy.title": "Privacy",
|
| 1101 |
"help.privacy.body": "Everything runs in your browser. No telemetry, no analytics, no data sent anywhere. Even the LLM model runs locally via WebGPU/WebAssembly. Your model_ids and questions never leave this page.",
|
| 1102 |
"help.source.title": "Source & paper",
|
| 1103 |
+
"help.source.body": "Source code: <a href=\"https://github.com/karlesmarin/tafagent\" target=\"_blank\">github.com/karlesmarin/tafagent</a><br>Paper: <em>Marin 2026 — Predicting How Transformers Attend</em> (<a href=\"https://zenodo.org/records/20314038\" target=\"_blank\">Zenodo</a>; arXiv forthcoming)<br>Dataset: <a href=\"https://huggingface.co/datasets/karlexmarin/taf-attention-decay\" target=\"_blank\">taf-attention-decay</a> — 58 γ-measurements across 32 models (CC-BY-4.0)",
|
| 1104 |
|
| 1105 |
"footer.text": "© 2026 Carles Marin · Apache-2.0 · independent research · the tool that closes the loop of the paper.",
|
| 1106 |
|
|
|
|
| 1819 |
"mode_desc.yarn": "Genera la configuración rope_scaling exacta para extender un modelo más allá de su contexto entrenado — más un veredicto TAF sobre si la calidad de atención aguanta realmente a la longitud objetivo.",
|
| 1820 |
"modes.gguf": "🧊 Puente GGUF",
|
| 1821 |
"mode_desc.gguf": "Lee la cabecera de metadata de un archivo GGUF (rope_theta, context_length, quant) en tu navegador y obtén un veredicto de calidad TAF — la pregunta que los calculadores de VRAM ignoran: ¿cabe Y funciona?",
|
| 1822 |
+
"modes.launch": "🚀 Flags de arranque",
|
| 1823 |
+
"mode_desc.launch": "Modelo + GPU + contexto → el comando exacto de arranque llama.cpp / Ollama (-ngl, -c, --no-mmap, tipo de KV-cache) con desglose de VRAM y aviso TAF cuando tu contexto pasa el horizonte usable.",
|
| 1824 |
+
"launch.title": "🚀 Generador de flags de arranque",
|
| 1825 |
+
"launch.tip": "<strong>Flags exactos + por qué, no solo \"cabe\"</strong>. Los calculadores de VRAM te dicen si un modelo cabe. Esto te da el comando <code>llama.cpp</code> / <code>Ollama</code> para pegar — <code>-ngl</code> capas a offload, <code>-c</code> contexto, <code>--no-mmap</code>, tipo de KV-cache — Y el chequeo de realidad TAF: si reservas KV para 128K pero el horizonte de atención del modelo es 32K, esa VRAM se desperdicia.",
|
| 1826 |
+
"launch.desc": "Elige modelo, GPU y contexto objetivo → obtén el comando exacto, desglose de VRAM (pesos + KV cache + overhead), y cuántas capas hacer offload. Resuelve el típico \"¿qué <code>-ngl</code> uso?\" / OOM de Blackwell.",
|
| 1827 |
+
"launch.model_label": "ID del modelo HF:",
|
| 1828 |
+
"launch.fetch_btn": "📥 Obtener geometría",
|
| 1829 |
+
"launch.quant_label": "Quant:",
|
| 1830 |
+
"launch.gpu_label": "GPU:",
|
| 1831 |
+
"launch.ctx_label": "Contexto objetivo L:",
|
| 1832 |
+
"launch.adv_label": "Avanzado:",
|
| 1833 |
+
"launch.cache_label": "KV cache:",
|
| 1834 |
+
"launch.fa_label": "Flash attention (-fa)",
|
| 1835 |
+
"launch.gen_btn": "🚀 Generar flags",
|
| 1836 |
+
"launch.need_id": "Introduce un id de modelo como 'Qwen/Qwen2.5-7B-Instruct'",
|
| 1837 |
+
"launch.fetching": "Obteniendo config.json de HF Hub…",
|
| 1838 |
+
"launch.layers": "capas",
|
| 1839 |
+
"launch.fetched_hint": "Elige GPU + contexto, luego Generar flags.",
|
| 1840 |
+
"launch.need_fetch": "Obtén un modelo primero (📥 Obtener geometría).",
|
| 1841 |
+
"launch.verdict.fits": "CABE — todo en GPU",
|
| 1842 |
+
"launch.verdict.partial": "PARCIAL — algunas capas en CPU (más lento)",
|
| 1843 |
+
"launch.verdict.too_big": "DEMASIADO GRANDE — no cabe ninguna capa en esta GPU",
|
| 1844 |
+
"launch.r.weights": "Pesos",
|
| 1845 |
+
"launch.r.kv": "KV cache",
|
| 1846 |
+
"launch.r.overhead": "Overhead / scratch",
|
| 1847 |
+
"launch.r.total": "Total",
|
| 1848 |
+
"launch.r.ngl": "Capas a offload (-ngl)",
|
| 1849 |
+
"launch.r.all": "todas",
|
| 1850 |
+
"launch.r.note": "La VRAM es una estimación (pesos por bits/param, KV por geometría de cabezas, scratch aproximado). d_horizon desde γ_Padé. Verifica el ajuste con una carga real — deja ~1 GB de margen.",
|
| 1851 |
+
"launch.warn.horizon_wasted": "El contexto objetivo pasa bastante el horizonte de atención del modelo — la KV para contexto más allá se desperdicia; el modelo no atenderá ahí. (TAF)",
|
| 1852 |
+
"launch.warn.beyond_trained": "L supera el contexto entrenado — también necesitas RoPE scaling para codificar posiciones tan lejos (ver Planificador YaRN).",
|
| 1853 |
+
"launch.warn.no_mmap": "Todas las capas caben → añadido --no-mmap para forzar los pesos a VRAM física (evita el problema de illegal-memory / OOM-al-cargar de Blackwell).",
|
| 1854 |
+
"launch.warn.partial": "Solo caben algunas capas en GPU — el resto corre en CPU (mucho más lento). Baja a un quant menor o contexto más corto para que quepa entero.",
|
| 1855 |
+
"launch.warn.cpu_only": "No cabe ninguna capa con estos ajustes — solo CPU. Usa un quant/contexto menor o una GPU mayor.",
|
| 1856 |
+
"launch.warn.no_params": "No se pudo leer el nº de parámetros — el tamaño de pesos es una estimación aproximada por geometría.",
|
| 1857 |
+
"launch.err.no_geom": "Obtén un modelo primero para leer su geometría.",
|
| 1858 |
+
"launch.err.no_gpu": "Elige una GPU o introduce un tamaño de VRAM personalizado.",
|
| 1859 |
+
"launch.err.no_ctx": "Introduce una longitud de contexto objetivo L.",
|
| 1860 |
+
"launch.copy": "Copiar comando",
|
| 1861 |
+
"help.v094.launch.title": "🚀 Generador de flags de arranque",
|
| 1862 |
+
"help.v094.launch.body": "Los calculadores de VRAM te dicen <em>si</em> un modelo cabe; no te dan el comando. Esto sí. Elige un modelo (obtiene geometría del <code>config.json</code> de HF), un quant, una GPU y un contexto objetivo — calcula el desglose de VRAM (pesos + KV cache + scratch), cuántas capas hacer offload (<code>-ngl</code>), y emite los comandos para pegar de <code>llama-server</code> y Ollama con contexto <code>-c</code>, flash-attention <code>-fa</code>, tipo de KV-cache, y <code>--no-mmap</code> (el fix de OOM de Blackwell). Más el chequeo de realidad TAF que ningún calculador da: si reservas KV para un contexto más allá del d_horizon del modelo, te avisa de que esa memoria se desperdicia. <em>Caso de uso</em>: '¿Qué <code>-ngl</code> para Llama-70B-Q4 en mi 4090?' → 39 de 80 capas, comando exacto, y un aviso si tu contexto pasa el horizonte usable.",
|
| 1863 |
"gguf.title": "🧊 Puente de validez GGUF",
|
| 1864 |
"gguf.tip": "<strong>Caber en VRAM ≠ funcionar</strong>. Los calculadores GGUF/VRAM leen la metadata de un modelo para decirte si un quant <em>cabe en tu GPU</em>. Esto lee la MISMA metadata (rope_theta, context_length, esquema de quant, geometría de cabezas) directamente de la cabecera <code>.gguf</code> vía HTTP Range — sin descargar GB — y responde lo que ellos no: ¿aguanta de verdad la calidad de atención, y cuánto la erosiona el quant (γ-shift, ΔPPL)?",
|
| 1865 |
"gguf.desc": "Pega un repo GGUF (p.ej. <code>Qwen/Qwen2.5-7B-Instruct-GGUF</code>), elige un archivo de quant, y obtén un veredicto de calidad TAF: el horizonte de atención efectivo del modelo, más cuánto desplaza γ la cuantización elegida para <em>esta arquitectura concreta</em>. Solo lee la cabecera del archivo en tu navegador.",
|
|
|
|
| 2490 |
"help.privacy.title": "Privacidad",
|
| 2491 |
"help.privacy.body": "Todo corre en tu navegador. Sin telemetría, sin analytics, sin datos enviados a ningún sitio. Incluso el modelo LLM corre localmente vía WebGPU/WebAssembly. Tus model_ids y preguntas nunca abandonan esta página.",
|
| 2492 |
"help.source.title": "Código fuente y paper",
|
| 2493 |
+
"help.source.body": "Código: <a href=\"https://github.com/karlesmarin/tafagent\" target=\"_blank\">github.com/karlesmarin/tafagent</a><br>Paper: <em>Marin 2026 — Predicting How Transformers Attend</em> (<a href=\"https://zenodo.org/records/20314038\" target=\"_blank\">Zenodo</a>; arXiv próximamente)<br>Dataset: <a href=\"https://huggingface.co/datasets/karlexmarin/taf-attention-decay\" target=\"_blank\">taf-attention-decay</a> — 58 mediciones γ sobre 32 modelos (CC-BY-4.0)",
|
| 2494 |
|
| 2495 |
"footer.text": "© 2026 Carles Marin · Apache-2.0 · investigación independiente · la herramienta que cierra el círculo del paper.",
|
| 2496 |
},
|
|
|
|
| 3063 |
"mode_desc.yarn": "Génère la configuration rope_scaling exacte pour étendre un modèle au-delà de son contexte d'entraînement — plus un verdict TAF sur la tenue réelle de la qualité d'attention à la longueur cible.",
|
| 3064 |
"modes.gguf": "🧊 Pont GGUF",
|
| 3065 |
"mode_desc.gguf": "Lit l'en-tête de métadonnées d'un fichier GGUF (rope_theta, context_length, quant) dans votre navigateur et donne un verdict de qualité TAF — la question que les calculateurs de VRAM ignorent : tient ET fonctionne ?",
|
| 3066 |
+
"modes.launch": "🚀 Flags de lancement",
|
| 3067 |
+
"mode_desc.launch": "Modèle + GPU + contexte → la commande exacte llama.cpp / Ollama (-ngl, -c, --no-mmap, type de KV-cache) avec ventilation VRAM et alerte TAF quand le contexte dépasse l'horizon utile.",
|
| 3068 |
+
"launch.title": "🚀 Générateur de flags de lancement",
|
| 3069 |
+
"launch.tip": "<strong>Flags exacts + pourquoi, pas juste \"tient\"</strong>. Les calculateurs de VRAM disent si un modèle tient. Ceci donne la commande <code>llama.cpp</code> / <code>Ollama</code> à coller — <code>-ngl</code> couches à décharger, <code>-c</code> contexte, <code>--no-mmap</code>, type de KV-cache — ET le contrôle de réalité TAF : si vous allouez du KV pour 128K mais que l'horizon d'attention du modèle est 32K, cette VRAM est gâchée.",
|
| 3070 |
+
"launch.desc": "Choisissez un modèle, un GPU et un contexte cible → obtenez la commande exacte, une ventilation VRAM (poids + KV cache + overhead), et combien de couches décharger. Résout le \"quel <code>-ngl</code> ?\" / OOM Blackwell récurrent.",
|
| 3071 |
+
"launch.model_label": "ID du modèle HF :",
|
| 3072 |
+
"launch.fetch_btn": "📥 Récupérer la géométrie",
|
| 3073 |
+
"launch.quant_label": "Quant :",
|
| 3074 |
+
"launch.gpu_label": "GPU :",
|
| 3075 |
+
"launch.ctx_label": "Contexte cible L :",
|
| 3076 |
+
"launch.adv_label": "Avancé :",
|
| 3077 |
+
"launch.cache_label": "KV cache :",
|
| 3078 |
+
"launch.fa_label": "Flash attention (-fa)",
|
| 3079 |
+
"launch.gen_btn": "🚀 Générer les flags",
|
| 3080 |
+
"launch.need_id": "Saisissez un id de modèle comme 'Qwen/Qwen2.5-7B-Instruct'",
|
| 3081 |
+
"launch.fetching": "Récupération de config.json depuis HF Hub…",
|
| 3082 |
+
"launch.layers": "couches",
|
| 3083 |
+
"launch.fetched_hint": "Choisissez GPU + contexte, puis Générer les flags.",
|
| 3084 |
+
"launch.need_fetch": "Récupérez d'abord un modèle (📥 Récupérer la géométrie).",
|
| 3085 |
+
"launch.verdict.fits": "TIENT — entièrement sur GPU",
|
| 3086 |
+
"launch.verdict.partial": "PARTIEL — certaines couches sur CPU (plus lent)",
|
| 3087 |
+
"launch.verdict.too_big": "TROP GROS — aucune couche ne tient sur ce GPU",
|
| 3088 |
+
"launch.r.weights": "Poids",
|
| 3089 |
+
"launch.r.kv": "KV cache",
|
| 3090 |
+
"launch.r.overhead": "Overhead / scratch",
|
| 3091 |
+
"launch.r.total": "Total",
|
| 3092 |
+
"launch.r.ngl": "Couches à décharger (-ngl)",
|
| 3093 |
+
"launch.r.all": "toutes",
|
| 3094 |
+
"launch.r.note": "La VRAM est une estimation (poids par bits/param, KV par géométrie des têtes, scratch grossier). d_horizon depuis γ_Padé. Vérifiez avec un chargement réel — laissez ~1 Go de marge.",
|
| 3095 |
+
"launch.warn.horizon_wasted": "Le contexte cible dépasse largement l'horizon d'attention du modèle — le KV au-delà est gâché ; le modèle n'y prêtera pas attention. (TAF)",
|
| 3096 |
+
"launch.warn.beyond_trained": "L dépasse le contexte d'entraînement — il faut aussi un RoPE scaling pour encoder les positions aussi loin (voir le Planificateur YaRN).",
|
| 3097 |
+
"launch.warn.no_mmap": "Toutes les couches tiennent → ajout de --no-mmap pour forcer les poids en VRAM physique (évite le problème illegal-memory / OOM-au-chargement de Blackwell).",
|
| 3098 |
+
"launch.warn.partial": "Seules certaines couches tiennent sur GPU — le reste tourne sur CPU (bien plus lent). Passez à un quant plus petit ou un contexte plus court pour tout faire tenir.",
|
| 3099 |
+
"launch.warn.cpu_only": "Aucune couche ne tient avec ces réglages — CPU seul. Utilisez un quant/contexte plus petit ou un GPU plus grand.",
|
| 3100 |
+
"launch.warn.no_params": "Impossible de lire le nombre de paramètres — la taille des poids est une estimation grossière par géométrie.",
|
| 3101 |
+
"launch.err.no_geom": "Récupérez d'abord un modèle pour lire sa géométrie.",
|
| 3102 |
+
"launch.err.no_gpu": "Choisissez un GPU ou saisissez une taille de VRAM personnalisée.",
|
| 3103 |
+
"launch.err.no_ctx": "Saisissez une longueur de contexte cible L.",
|
| 3104 |
+
"launch.copy": "Copier la commande",
|
| 3105 |
+
"help.v094.launch.title": "🚀 Générateur de flags de lancement",
|
| 3106 |
+
"help.v094.launch.body": "Les calculateurs de VRAM disent <em>si</em> un modèle tient ; ils ne donnent pas la commande. Ceci si. Choisissez un modèle (récupère la géométrie du <code>config.json</code> HF), un quant, un GPU et un contexte cible — il calcule la ventilation VRAM (poids + KV cache + scratch), combien de couches décharger (<code>-ngl</code>), et émet les commandes à coller <code>llama-server</code> et Ollama avec contexte <code>-c</code>, flash-attention <code>-fa</code>, type de KV-cache, et <code>--no-mmap</code> (le fix OOM Blackwell). Plus le contrôle de réalité TAF qu'aucun calculateur ne donne : si vous allouez du KV pour un contexte au-delà du d_horizon du modèle, il vous avertit que cette mémoire est gâchée. <em>Cas d'usage</em> : 'Quel <code>-ngl</code> pour Llama-70B-Q4 sur mon 4090 ?' → 39 couches sur 80, commande exacte, et une note si le contexte dépasse l'horizon utile.",
|
| 3107 |
"gguf.title": "🧊 Pont de validité GGUF",
|
| 3108 |
"gguf.tip": "<strong>Tenir dans la VRAM ≠ fonctionner</strong>. Les calculateurs GGUF/VRAM lisent les métadonnées d'un modèle pour dire si un quant <em>tient dans le GPU</em>. Ceci lit les MÊMES métadonnées (rope_theta, context_length, schéma de quant, géométrie des têtes) directement depuis l'en-tête <code>.gguf</code> via HTTP Range — sans télécharger des Go — et répond à ce qu'ils n'abordent pas : la qualité d'attention tient-elle vraiment, et de combien le quant l'érode-t-il (γ-shift, ΔPPL) ?",
|
| 3109 |
"gguf.desc": "Collez un dépôt GGUF (ex. <code>Qwen/Qwen2.5-7B-Instruct-GGUF</code>), choisissez un fichier de quant, et obtenez un verdict de qualité TAF : l'horizon d'attention effectif du modèle, plus de combien la quantification choisie décale γ pour <em>cette architecture précise</em>. Ne lit que l'en-tête du fichier dans votre navigateur.",
|
|
|
|
| 3734 |
"help.privacy.title": "Confidentialité",
|
| 3735 |
"help.privacy.body": "Tout s'exécute dans votre navigateur. Pas de télémétrie, pas d'analytique, pas de données envoyées ailleurs. Même le modèle LLM s'exécute localement via WebGPU/WebAssembly. Vos model_ids et questions ne quittent jamais cette page.",
|
| 3736 |
"help.source.title": "Code source et paper",
|
| 3737 |
+
"help.source.body": "Code : <a href=\"https://github.com/karlesmarin/tafagent\" target=\"_blank\">github.com/karlesmarin/tafagent</a><br>Paper : <em>Marin 2026 — Predicting How Transformers Attend</em> (<a href=\"https://zenodo.org/records/20314038\" target=\"_blank\">Zenodo</a> ; arXiv à venir)<br>Dataset : <a href=\"https://huggingface.co/datasets/karlexmarin/taf-attention-decay\" target=\"_blank\">taf-attention-decay</a> — 58 mesures γ sur 32 modèles (CC-BY-4.0)",
|
| 3738 |
|
| 3739 |
"footer.text": "© 2026 Carles Marin · Apache-2.0 · recherche indépendante · l'outil qui ferme la boucle du paper.",
|
| 3740 |
},
|
|
|
|
| 4307 |
"mode_desc.yarn": "生成精确的 rope_scaling 配置以将模型扩展到训练上下文之外 —— 外加 TAF 裁决:在目标长度下注意力质量是否真的撑得住。",
|
| 4308 |
"modes.gguf": "🧊 GGUF 桥",
|
| 4309 |
"mode_desc.gguf": "在浏览器内读取 GGUF 文件的元数据头(rope_theta、context_length、量化),给出 TAF 质量裁决 —— 显存计算器跳过的那个问题:塞得进且还能用吗?",
|
| 4310 |
+
"modes.launch": "🚀 启动参数",
|
| 4311 |
+
"mode_desc.launch": "模型 + GPU + 上下文 → 精确的 llama.cpp / Ollama 启动命令(-ngl、-c、--no-mmap、KV-cache 类型),附显存明细,以及当上下文超过可用视界时的 TAF 警告。",
|
| 4312 |
+
"launch.title": "🚀 启动参数生成器",
|
| 4313 |
+
"launch.tip": "<strong>精确参数 + 原因,不只是\"塞得进\"</strong>。显存计算器告诉你模型是否塞得进。本工具给你可复制粘贴的 <code>llama.cpp</code> / <code>Ollama</code> 命令 —— <code>-ngl</code> 卸载层数、<code>-c</code> 上下文、<code>--no-mmap</code>、KV-cache 类型 —— 以及 TAF 现实检查:若你为 128K 分配 KV 但模型注意力视界只有 32K,那部分显存就浪费了。",
|
| 4314 |
+
"launch.desc": "选择模型、GPU 和目标上下文 → 获得精确启动命令、显存明细(权重 + KV cache + 开销),以及卸载多少层。解决常见的\"该用什么 <code>-ngl</code>?\"/ Blackwell OOM 的猜测。",
|
| 4315 |
+
"launch.model_label": "HF 模型 id:",
|
| 4316 |
+
"launch.fetch_btn": "📥 获取几何",
|
| 4317 |
+
"launch.quant_label": "量化:",
|
| 4318 |
+
"launch.gpu_label": "GPU:",
|
| 4319 |
+
"launch.ctx_label": "目标上下文 L:",
|
| 4320 |
+
"launch.adv_label": "高级:",
|
| 4321 |
+
"launch.cache_label": "KV cache:",
|
| 4322 |
+
"launch.fa_label": "Flash attention (-fa)",
|
| 4323 |
+
"launch.gen_btn": "🚀 生成参数",
|
| 4324 |
+
"launch.need_id": "输入模型 id,如 'Qwen/Qwen2.5-7B-Instruct'",
|
| 4325 |
+
"launch.fetching": "正在从 HF Hub 获取 config.json…",
|
| 4326 |
+
"launch.layers": "层",
|
| 4327 |
+
"launch.fetched_hint": "选择 GPU + 上下文,然后生成参数。",
|
| 4328 |
+
"launch.need_fetch": "请先获取模型(📥 获取几何)。",
|
| 4329 |
+
"launch.verdict.fits": "塞得进 —— 全部在 GPU",
|
| 4330 |
+
"launch.verdict.partial": "部分 —— 部分层在 CPU(更慢)",
|
| 4331 |
+
"launch.verdict.too_big": "太大 —— 此 GPU 一层都放不下",
|
| 4332 |
+
"launch.r.weights": "权重",
|
| 4333 |
+
"launch.r.kv": "KV cache",
|
| 4334 |
+
"launch.r.overhead": "开销 / scratch",
|
| 4335 |
+
"launch.r.total": "总计",
|
| 4336 |
+
"launch.r.ngl": "卸载层数 (-ngl)",
|
| 4337 |
+
"launch.r.all": "全部",
|
| 4338 |
+
"launch.r.note": "显存为估计值(权重按 bits/参数,KV 按头几何,scratch 粗略)。d_horizon 来自 γ_Padé。请用真实加载核实 —— 留约 1 GB 余量。",
|
| 4339 |
+
"launch.warn.horizon_wasted": "目标上下文远超模型的注意力视界 —— 超出部分的 KV 内存被浪费;模型不会关注那里。(TAF)",
|
| 4340 |
+
"launch.warn.beyond_trained": "L 超过训练上下文 —— 还需要 RoPE scaling 才能编码那么远的位置(见 YaRN 规划器)。",
|
| 4341 |
+
"launch.warn.no_mmap": "所有层都放得下 → 已加 --no-mmap 强制权重进入物理显存(避免 Blackwell 的 illegal-memory / 加载时 OOM 问题)。",
|
| 4342 |
+
"launch.warn.partial": "只有部分层放进 GPU —— 其余在 CPU 运行(慢得多)。换更小的量化或更短的上下文以完整放入。",
|
| 4343 |
+
"launch.warn.cpu_only": "这些设置下一层都放不下 —— 仅 CPU。请用更小的量化/上下文或更大的 GPU。",
|
| 4344 |
+
"launch.warn.no_params": "无法读取参数量 —— 权重大小为按几何的粗略估计。",
|
| 4345 |
+
"launch.err.no_geom": "请先获取模型以读取其几何。",
|
| 4346 |
+
"launch.err.no_gpu": "请选择 GPU 或输入自定义显存大小。",
|
| 4347 |
+
"launch.err.no_ctx": "请输入目标上下文长度 L。",
|
| 4348 |
+
"launch.copy": "复制命令",
|
| 4349 |
+
"help.v094.launch.title": "🚀 启动参数生成器",
|
| 4350 |
+
"help.v094.launch.body": "显存计算器告诉你模型<em>是否</em>塞得进;它们不给你命令。本工具给。选择一个模型(从 HF <code>config.json</code> 获取几何)、一个量化、一个 GPU 和目标上下文 —— 它计算显存明细(权重 + KV cache + scratch)、卸载多少层(<code>-ngl</code>),并输出可复制粘贴的 <code>llama-server</code> 和 Ollama 命令,带 <code>-c</code> 上下文、<code>-fa</code> flash-attention、KV-cache 类型,以及 <code>--no-mmap</code>(Blackwell OOM 修复)。还有任何计算器都不给的 TAF 现实检查:若你为超过模型 d_horizon 的上下文分配 KV,它会警告你那部分内存被浪费。<em>用例</em>:'我的 4090 上 Llama-70B-Q4 该用什么 <code>-ngl</code>?' → 80 层中的 39 层、精确命令,以及若上下文超过可用视界的提示。",
|
| 4351 |
"gguf.title": "🧊 GGUF 有效性桥",
|
| 4352 |
"gguf.tip": "<strong>塞进显存 ≠ 能用</strong>。GGUF/显存计算器读取模型元数据来告诉你某量化<em>是否塞得进 GPU</em>。本工具通过 HTTP Range 直接从 <code>.gguf</code> 头读取同样的元数据(rope_theta、context_length、量化方案、注意力头几何)—— 无需下载数 GB —— 并回答它们不答的:注意力质量是否真的撑得住,量化又侵蚀了多少(γ-shift、ΔPPL)?",
|
| 4353 |
"gguf.desc": "粘贴一个 GGUF 仓库(如 <code>Qwen/Qwen2.5-7B-Instruct-GGUF</code>),选择一个量化文件,获得 TAF 质量裁决:模型的有效注意力视界,以及所选量化对<em>这个具体架构</em>的 γ 位移有多大。只在浏览器内读取文件头。",
|
|
|
|
| 4978 |
"help.privacy.title": "隐私",
|
| 4979 |
"help.privacy.body": "一切都在您的浏览器中运行。无遥测,无分析,无数据发送到任何地方。即使是 LLM 模型也通过 WebGPU/WebAssembly 在本地运行。您的 model_ids 和问题永不离开此页面。",
|
| 4980 |
"help.source.title": "源代码和论文",
|
| 4981 |
+
"help.source.body": "源代码: <a href=\"https://github.com/karlesmarin/tafagent\" target=\"_blank\">github.com/karlesmarin/tafagent</a><br>论文: <em>Marin 2026 — Predicting How Transformers Attend</em> (<a href=\"https://zenodo.org/records/20314038\" target=\"_blank\">Zenodo</a>; arXiv 即将)<br>数据集: <a href=\"https://huggingface.co/datasets/karlexmarin/taf-attention-decay\" target=\"_blank\">taf-attention-decay</a> — 32个模型上的58次γ测量 (CC-BY-4.0)",
|
| 4982 |
|
| 4983 |
"footer.text": "© 2026 Carles Marin · Apache-2.0 · 独立研究 · 闭合论文回路的工具。",
|
| 4984 |
},
|
|
@@ -0,0 +1,170 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
// Launch-Flag Generator (v0.9.4 anti-bullshit pack)
|
| 2 |
+
//
|
| 3 |
+
// Input a model + GPU + target context → the exact llama.cpp / Ollama launch
|
| 4 |
+
// flags (-ngl layers to offload, -c context, --no-mmap, cache-type), with a
|
| 5 |
+
// VRAM breakdown AND the TAF angle the pure VRAM calculators miss: "you CAN
|
| 6 |
+
// allocate KV for 128K, but this model's attention horizon is ~32K — context
|
| 7 |
+
// past that is wasted memory." Solves the recurring r/LocalLLaMA pain of
|
| 8 |
+
// guessing -ngl / hitting Blackwell OOM. All browser-only.
|
| 9 |
+
|
| 10 |
+
import { gammaPade } from "./gamma_check.js";
|
| 11 |
+
import { dHorizon } from "./yarn_planner.js";
|
| 12 |
+
|
| 13 |
+
// Curated GPU VRAM presets (GB). Unified-memory Macs included (shared pool).
|
| 14 |
+
export const GPU_PRESETS = [
|
| 15 |
+
{ id: "rtx3060", label: "RTX 3060 12GB", vram: 12 },
|
| 16 |
+
{ id: "rtx4060ti",label: "RTX 4060 Ti 16GB", vram: 16 },
|
| 17 |
+
{ id: "rtx4070", label: "RTX 4070 12GB", vram: 12 },
|
| 18 |
+
{ id: "rtx4080", label: "RTX 4080 16GB", vram: 16 },
|
| 19 |
+
{ id: "rtx3090", label: "RTX 3090 24GB", vram: 24 },
|
| 20 |
+
{ id: "rtx4090", label: "RTX 4090 24GB", vram: 24 },
|
| 21 |
+
{ id: "rtx5090", label: "RTX 5090 32GB", vram: 32 },
|
| 22 |
+
{ id: "a100_40", label: "A100 40GB", vram: 40 },
|
| 23 |
+
{ id: "a100_80", label: "A100 80GB", vram: 80 },
|
| 24 |
+
{ id: "h100", label: "H100 80GB", vram: 80 },
|
| 25 |
+
{ id: "h200", label: "H200 141GB", vram: 141 },
|
| 26 |
+
{ id: "mac32", label: "Mac 32GB (unified)",vram: 24 }, // ~75% usable for GPU
|
| 27 |
+
{ id: "mac64", label: "Mac 64GB (unified)",vram: 48 },
|
| 28 |
+
{ id: "mac128", label: "Mac 128GB (unified)",vram: 96 },
|
| 29 |
+
];
|
| 30 |
+
|
| 31 |
+
// Effective bits-per-weight per GGUF quant (includes K-quant block overhead).
|
| 32 |
+
export const QUANT_BPW = {
|
| 33 |
+
F16: 16.0,
|
| 34 |
+
Q8_0: 8.5,
|
| 35 |
+
Q6_K: 6.56,
|
| 36 |
+
Q5_K_M: 5.67,
|
| 37 |
+
Q4_K_M: 4.83,
|
| 38 |
+
Q4_0: 4.55,
|
| 39 |
+
Q3_K_M: 3.91,
|
| 40 |
+
Q2_K: 2.63,
|
| 41 |
+
};
|
| 42 |
+
|
| 43 |
+
// KV-cache element bytes per cache dtype.
|
| 44 |
+
const CACHE_BYTES = { fp16: 2, q8_0: 1, q4_0: 0.5 };
|
| 45 |
+
|
| 46 |
+
const GB = 1024 ** 3;
|
| 47 |
+
|
| 48 |
+
// Estimate parameter count from geometry when the model card doesn't state it.
|
| 49 |
+
// Uses the exact decoder layout (attention with GQA + SwiGLU MLP + embeddings)
|
| 50 |
+
// when intermediate_size is known — the 12·h² shortcut undercounts modern
|
| 51 |
+
// large-FFN models (Qwen2.5-7B is really 7.6B, not the ~5.4B the shortcut gives).
|
| 52 |
+
export function estimateNParams({ nParams, hidden, nLayers, vocab, intermediate, nKvHeads, headDim, tieEmbeddings }) {
|
| 53 |
+
if (Number.isFinite(nParams) && nParams > 0) return nParams;
|
| 54 |
+
if (!hidden || !nLayers) return null;
|
| 55 |
+
let perLayer;
|
| 56 |
+
if (intermediate) {
|
| 57 |
+
const kvDim = (nKvHeads && headDim) ? nKvHeads * headDim : hidden; // GQA shrinks K,V
|
| 58 |
+
const attn = 2 * hidden * hidden + 2 * hidden * kvDim; // q,o + k,v
|
| 59 |
+
const mlp = 3 * hidden * intermediate; // gate,up,down (SwiGLU)
|
| 60 |
+
perLayer = attn + mlp;
|
| 61 |
+
} else {
|
| 62 |
+
perLayer = 12 * hidden * hidden; // fallback heuristic
|
| 63 |
+
}
|
| 64 |
+
const embed = vocab ? (tieEmbeddings ? 1 : 2) * vocab * hidden : 0;
|
| 65 |
+
return perLayer * nLayers + embed;
|
| 66 |
+
}
|
| 67 |
+
|
| 68 |
+
// KV cache bytes for the whole model at context L.
|
| 69 |
+
function kvCacheBytes(nLayers, nKvHeads, headDim, L, cacheType) {
|
| 70 |
+
const elem = CACHE_BYTES[cacheType] ?? 2;
|
| 71 |
+
return 2 /* K+V */ * nLayers * nKvHeads * headDim * L * elem;
|
| 72 |
+
}
|
| 73 |
+
|
| 74 |
+
export function planLaunch(opts) {
|
| 75 |
+
const {
|
| 76 |
+
nParams, nLayers, nKvHeads, headDim, hidden, ropeTheta, ctxTrain,
|
| 77 |
+
quant = "Q4_K_M", vramGB, targetCtx, cacheType = "fp16", flashAttn = true,
|
| 78 |
+
} = opts;
|
| 79 |
+
|
| 80 |
+
const out = { ok: false, warnings: [] };
|
| 81 |
+
if (!nLayers || !nKvHeads || !headDim) { out.verdict = "no_geometry"; return out; }
|
| 82 |
+
if (!Number.isFinite(vramGB) || vramGB <= 0) { out.verdict = "no_gpu"; return out; }
|
| 83 |
+
if (!Number.isFinite(targetCtx) || targetCtx <= 0) { out.verdict = "no_ctx"; return out; }
|
| 84 |
+
|
| 85 |
+
const bpw = QUANT_BPW[quant] ?? 4.83;
|
| 86 |
+
const N = estimateNParams({
|
| 87 |
+
nParams, hidden, nLayers, vocab: opts.vocab,
|
| 88 |
+
intermediate: opts.intermediate, nKvHeads, headDim, tieEmbeddings: opts.tieEmbeddings,
|
| 89 |
+
});
|
| 90 |
+
|
| 91 |
+
const weightsB = N ? (N * bpw / 8) : null;
|
| 92 |
+
const kvB = kvCacheBytes(nLayers, nKvHeads, headDim, targetCtx, cacheType);
|
| 93 |
+
// Compute/scratch buffer: roughly scales with context × hidden. Flash-attention
|
| 94 |
+
// shrinks the attention scratch substantially. Coarse estimate, flagged as such.
|
| 95 |
+
const scratchB = (flashAttn ? 0.25 : 0.6) * GB + (hidden ? 0.5 * hidden * targetCtx * 2 : 0);
|
| 96 |
+
const overheadB = 0.4 * GB + scratchB;
|
| 97 |
+
|
| 98 |
+
const weightsGB = weightsB != null ? weightsB / GB : null;
|
| 99 |
+
const kvGB = kvB / GB;
|
| 100 |
+
const overheadGB = overheadB / GB;
|
| 101 |
+
const totalGB = (weightsGB ?? 0) + kvGB + overheadGB;
|
| 102 |
+
|
| 103 |
+
// Layer-offload (-ngl). ~88% of weights live in transformer layers; the rest
|
| 104 |
+
// (embeddings/output) load with any GPU offload.
|
| 105 |
+
const layerFrac = 0.88;
|
| 106 |
+
const layerWeightsGB = weightsGB != null ? weightsGB * layerFrac : null;
|
| 107 |
+
const nonLayerGB = weightsGB != null ? weightsGB * (1 - layerFrac) : 0;
|
| 108 |
+
const kvPerLayerGB = kvGB / nLayers;
|
| 109 |
+
const perLayerGB = (layerWeightsGB != null ? layerWeightsGB / nLayers : 0) + kvPerLayerGB;
|
| 110 |
+
|
| 111 |
+
let ngl, allOnGpu, fits;
|
| 112 |
+
if (weightsGB == null) {
|
| 113 |
+
ngl = null; allOnGpu = false; fits = false;
|
| 114 |
+
out.warnings.push({ code: "no_params" });
|
| 115 |
+
} else if (totalGB <= vramGB) {
|
| 116 |
+
ngl = nLayers; allOnGpu = true; fits = true;
|
| 117 |
+
} else {
|
| 118 |
+
const avail = vramGB - overheadGB - nonLayerGB;
|
| 119 |
+
ngl = perLayerGB > 0 ? Math.max(0, Math.floor(avail / perLayerGB)) : 0;
|
| 120 |
+
ngl = Math.min(ngl, nLayers);
|
| 121 |
+
allOnGpu = false; fits = false;
|
| 122 |
+
}
|
| 123 |
+
|
| 124 |
+
// TAF horizon: does the model's attention actually reach the context you're
|
| 125 |
+
// paying KV memory for? This is the differentiator vs pure VRAM calculators.
|
| 126 |
+
const theta = Number(ropeTheta) || 10000;
|
| 127 |
+
const gammaTrain = ctxTrain ? gammaPade(theta, ctxTrain) : null;
|
| 128 |
+
const dHoriz = gammaTrain != null ? dHorizon(theta, gammaTrain) : null;
|
| 129 |
+
const horizonWasted = dHoriz != null && targetCtx > dHoriz * 1.25;
|
| 130 |
+
if (horizonWasted) out.warnings.push({ code: "horizon_wasted", params: { dHoriz, target: targetCtx } });
|
| 131 |
+
if (ctxTrain && targetCtx > ctxTrain) out.warnings.push({ code: "beyond_trained", params: { ctxTrain, target: targetCtx } });
|
| 132 |
+
if (allOnGpu) out.warnings.push({ code: "no_mmap_blackwell" });
|
| 133 |
+
if (!fits && ngl > 0) out.warnings.push({ code: "partial_offload", params: { ngl, nLayers } });
|
| 134 |
+
if (!fits && ngl === 0) out.warnings.push({ code: "cpu_only", params: {} });
|
| 135 |
+
|
| 136 |
+
out.ok = true;
|
| 137 |
+
Object.assign(out, {
|
| 138 |
+
verdict: fits ? "fits" : (ngl > 0 ? "partial" : "too_big"),
|
| 139 |
+
nParams: N, bpw, quant, cacheType, flashAttn,
|
| 140 |
+
weightsGB, kvGB, overheadGB, totalGB, vramGB,
|
| 141 |
+
ngl, allOnGpu, nLayers,
|
| 142 |
+
theta, dHoriz, gammaTrain, ctxTrain, targetCtx,
|
| 143 |
+
});
|
| 144 |
+
return out;
|
| 145 |
+
}
|
| 146 |
+
|
| 147 |
+
// Build the copy-paste commands for both engines.
|
| 148 |
+
export function launchCommands(plan, modelRef = "<model.gguf>") {
|
| 149 |
+
const nglStr = plan.allOnGpu ? "99" : String(plan.ngl);
|
| 150 |
+
const cache = plan.cacheType !== "fp16" ? ` -ctk ${plan.cacheType} -ctv ${plan.cacheType}` : "";
|
| 151 |
+
const fa = plan.flashAttn ? " -fa" : "";
|
| 152 |
+
const mmap = plan.allOnGpu ? " --no-mmap" : "";
|
| 153 |
+
const llamacpp =
|
| 154 |
+
`llama-server -m ${modelRef} \\\n` +
|
| 155 |
+
` -ngl ${nglStr} -c ${plan.targetCtx}${fa}${cache}${mmap}`;
|
| 156 |
+
|
| 157 |
+
// Ollama: Modelfile params + env. num_gpu = layers on GPU.
|
| 158 |
+
const olEnv = [
|
| 159 |
+
plan.flashAttn ? "OLLAMA_FLASH_ATTENTION=1" : null,
|
| 160 |
+
plan.cacheType !== "fp16" ? `OLLAMA_KV_CACHE_TYPE=${plan.cacheType}` : null,
|
| 161 |
+
].filter(Boolean).join(" ");
|
| 162 |
+
const ollama =
|
| 163 |
+
(olEnv ? olEnv + " \\\n" : "") +
|
| 164 |
+
`ollama run <model>\n` +
|
| 165 |
+
`# Modelfile / params:\n` +
|
| 166 |
+
`PARAMETER num_ctx ${plan.targetCtx}\n` +
|
| 167 |
+
`PARAMETER num_gpu ${nglStr === "99" ? plan.nLayers : nglStr}`;
|
| 168 |
+
|
| 169 |
+
return { llamacpp, ollama };
|
| 170 |
+
}
|
|
@@ -40,6 +40,7 @@ import {
|
|
| 40 |
} from "./longscore.js";
|
| 41 |
import { planExtension, suggestRopeType } from "./yarn_planner.js";
|
| 42 |
import { listGgufFiles, fetchGgufMetadata, ggufToConfig, quantFromFilename, analyzeGguf } from "./gguf_bridge.js";
|
|
|
|
| 43 |
|
| 44 |
// Attach HF Hub search-as-you-type to all 5 model id inputs (Profile, Recipe,
|
| 45 |
// Unmask, Template, Quant). Hits public huggingface.co/api/models. Idempotent.
|
|
@@ -235,6 +236,7 @@ document.addEventListener("click", (e) => {
|
|
| 235 |
hub: "hub-section",
|
| 236 |
yarn: "yarn-section",
|
| 237 |
gguf: "gguf-section",
|
|
|
|
| 238 |
}[targetMode];
|
| 239 |
if (sectionId) {
|
| 240 |
const sec = document.getElementById(sectionId);
|
|
@@ -259,7 +261,7 @@ document.querySelectorAll(".mode-btn").forEach(btn => {
|
|
| 259 |
"diagnose-section", "phase-section", "unmask-section",
|
| 260 |
"template-section", "arena-section", "contam-section",
|
| 261 |
"quant-section", "drift-section", "niah-section",
|
| 262 |
-
"saturation-section", "cot-section", "peft-section", "cache-section", "speculative-section", "tax-section", "longscore-section", "hub-section", "yarn-section", "gguf-section"].forEach(id => {
|
| 263 |
const el = $(id);
|
| 264 |
if (el) el.style.display = "none";
|
| 265 |
});
|
|
@@ -280,6 +282,7 @@ document.querySelectorAll(".mode-btn").forEach(btn => {
|
|
| 280 |
hub: "hub-section",
|
| 281 |
yarn: "yarn-section",
|
| 282 |
gguf: "gguf-section",
|
|
|
|
| 283 |
};
|
| 284 |
const sectionId = sectionMap[mode];
|
| 285 |
if (sectionId) $(sectionId).style.display = "";
|
|
@@ -295,6 +298,7 @@ document.querySelectorAll(".mode-btn").forEach(btn => {
|
|
| 295 |
if (mode === "hub") initHub();
|
| 296 |
if (mode === "yarn") initYarn();
|
| 297 |
if (mode === "gguf") initGguf();
|
|
|
|
| 298 |
});
|
| 299 |
});
|
| 300 |
|
|
@@ -4951,6 +4955,127 @@ function renderGgufComparison(cfg, rows) {
|
|
| 4951 |
<p class="subtle" style="font-size:0.88em;">${t("gguf.r.note")}</p>`;
|
| 4952 |
}
|
| 4953 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 4954 |
// ════════════════════════════════════════════════════════════════════
|
| 4955 |
// Bootstrap
|
| 4956 |
// ════════════════════════════════════════════════════════════════════
|
|
|
|
| 40 |
} from "./longscore.js";
|
| 41 |
import { planExtension, suggestRopeType } from "./yarn_planner.js";
|
| 42 |
import { listGgufFiles, fetchGgufMetadata, ggufToConfig, quantFromFilename, analyzeGguf } from "./gguf_bridge.js";
|
| 43 |
+
import { GPU_PRESETS, QUANT_BPW, planLaunch, launchCommands } from "./launch_flags.js";
|
| 44 |
|
| 45 |
// Attach HF Hub search-as-you-type to all 5 model id inputs (Profile, Recipe,
|
| 46 |
// Unmask, Template, Quant). Hits public huggingface.co/api/models. Idempotent.
|
|
|
|
| 236 |
hub: "hub-section",
|
| 237 |
yarn: "yarn-section",
|
| 238 |
gguf: "gguf-section",
|
| 239 |
+
launch: "launch-section",
|
| 240 |
}[targetMode];
|
| 241 |
if (sectionId) {
|
| 242 |
const sec = document.getElementById(sectionId);
|
|
|
|
| 261 |
"diagnose-section", "phase-section", "unmask-section",
|
| 262 |
"template-section", "arena-section", "contam-section",
|
| 263 |
"quant-section", "drift-section", "niah-section",
|
| 264 |
+
"saturation-section", "cot-section", "peft-section", "cache-section", "speculative-section", "tax-section", "longscore-section", "hub-section", "yarn-section", "gguf-section", "launch-section"].forEach(id => {
|
| 265 |
const el = $(id);
|
| 266 |
if (el) el.style.display = "none";
|
| 267 |
});
|
|
|
|
| 282 |
hub: "hub-section",
|
| 283 |
yarn: "yarn-section",
|
| 284 |
gguf: "gguf-section",
|
| 285 |
+
launch: "launch-section",
|
| 286 |
};
|
| 287 |
const sectionId = sectionMap[mode];
|
| 288 |
if (sectionId) $(sectionId).style.display = "";
|
|
|
|
| 298 |
if (mode === "hub") initHub();
|
| 299 |
if (mode === "yarn") initYarn();
|
| 300 |
if (mode === "gguf") initGguf();
|
| 301 |
+
if (mode === "launch") initLaunch();
|
| 302 |
});
|
| 303 |
});
|
| 304 |
|
|
|
|
| 4955 |
<p class="subtle" style="font-size:0.88em;">${t("gguf.r.note")}</p>`;
|
| 4956 |
}
|
| 4957 |
|
| 4958 |
+
// ════════════════════════════════════════════════════════════════════
|
| 4959 |
+
// 🚀 Launch-Flag Generator (v0.9.4)
|
| 4960 |
+
// ════════════════════════════════════════════════════════════════════
|
| 4961 |
+
let _launchWired = false;
|
| 4962 |
+
let _launchGeom = null; // fetched model geometry
|
| 4963 |
+
function initLaunch() {
|
| 4964 |
+
if (_launchWired) return;
|
| 4965 |
+
_launchWired = true;
|
| 4966 |
+
|
| 4967 |
+
// Populate GPU presets.
|
| 4968 |
+
const gpuSel = $("launch-gpu");
|
| 4969 |
+
if (gpuSel && !gpuSel.options.length) {
|
| 4970 |
+
gpuSel.innerHTML = GPU_PRESETS.map(g => `<option value="${g.vram}">${escapeHtml(g.label)}</option>`).join("");
|
| 4971 |
+
gpuSel.value = "24"; // sensible default (4090)
|
| 4972 |
+
}
|
| 4973 |
+
|
| 4974 |
+
const fetchBtn = $("launch-fetch-btn");
|
| 4975 |
+
const modelEl = $("launch-model");
|
| 4976 |
+
// Picking from autocomplete auto-fetches geometry (matches the other modes).
|
| 4977 |
+
if (modelEl) attachHfAutocomplete(modelEl, { onSelect: () => fetchBtn?.click() });
|
| 4978 |
+
|
| 4979 |
+
fetchBtn?.addEventListener("click", async () => {
|
| 4980 |
+
const id = (modelEl.value || "").trim();
|
| 4981 |
+
if (!id) { $("launch-status").textContent = "⚠ " + t("launch.need_id"); return; }
|
| 4982 |
+
$("launch-status").textContent = "⏳ " + t("launch.fetching");
|
| 4983 |
+
fetchBtn.disabled = true;
|
| 4984 |
+
state.lastModelId = id;
|
| 4985 |
+
try {
|
| 4986 |
+
const cfg = await fetchHfConfig(id);
|
| 4987 |
+
const nAttn = cfg.num_attention_heads ?? null;
|
| 4988 |
+
const rs = (cfg.rope_scaling && typeof cfg.rope_scaling === "object") ? cfg.rope_scaling : {};
|
| 4989 |
+
_launchGeom = {
|
| 4990 |
+
nLayers: cfg.num_hidden_layers ?? null,
|
| 4991 |
+
nKvHeads: cfg.num_key_value_heads ?? nAttn,
|
| 4992 |
+
headDim: cfg.head_dim ?? (cfg.hidden_size && nAttn ? cfg.hidden_size / nAttn : null),
|
| 4993 |
+
hidden: cfg.hidden_size ?? null,
|
| 4994 |
+
vocab: cfg.vocab_size ?? null,
|
| 4995 |
+
intermediate: cfg.intermediate_size ?? null,
|
| 4996 |
+
tieEmbeddings: cfg.tie_word_embeddings ?? false,
|
| 4997 |
+
nParams: cfg.num_parameters ?? null,
|
| 4998 |
+
ropeTheta: cfg.rope_theta ?? 10000,
|
| 4999 |
+
ctxTrain: rs.original_max_position_embeddings ?? cfg.max_position_embeddings ?? null,
|
| 5000 |
+
};
|
| 5001 |
+
if (!$("launch-ctx").value && _launchGeom.ctxTrain) $("launch-ctx").value = _launchGeom.ctxTrain;
|
| 5002 |
+
const via = cfg.__via_mirror ? ` (via ${escapeHtml(cfg.__via_mirror)})` : "";
|
| 5003 |
+
$("launch-status").innerHTML = `✅ <strong>${escapeHtml(id)}</strong>${via}: ${_launchGeom.nLayers} ${t("launch.layers")}, ` +
|
| 5004 |
+
`GQA ${nAttn}:${_launchGeom.nKvHeads}, θ=${_thetaFmt(_launchGeom.ropeTheta)}, ctx ${_yarnFmtK(_launchGeom.ctxTrain)}. ${t("launch.fetched_hint")}`;
|
| 5005 |
+
} catch (err) {
|
| 5006 |
+
$("launch-status").textContent = `❌ ${err.message}`;
|
| 5007 |
+
} finally {
|
| 5008 |
+
fetchBtn.disabled = false;
|
| 5009 |
+
}
|
| 5010 |
+
});
|
| 5011 |
+
|
| 5012 |
+
$("launch-gen-btn")?.addEventListener("click", () => {
|
| 5013 |
+
if (!_launchGeom) { $("launch-status").textContent = "⚠ " + t("launch.need_fetch"); return; }
|
| 5014 |
+
const vram = parseFloat($("launch-vram").value) || parseFloat(gpuSel.value);
|
| 5015 |
+
const plan = planLaunch({
|
| 5016 |
+
..._launchGeom,
|
| 5017 |
+
quant: $("launch-quant").value,
|
| 5018 |
+
vramGB: vram,
|
| 5019 |
+
targetCtx: parseFloat($("launch-ctx").value),
|
| 5020 |
+
cacheType: $("launch-cache").value,
|
| 5021 |
+
flashAttn: $("launch-fa").checked,
|
| 5022 |
+
});
|
| 5023 |
+
renderLaunch(plan);
|
| 5024 |
+
});
|
| 5025 |
+
}
|
| 5026 |
+
|
| 5027 |
+
function _launchWarnText(w) {
|
| 5028 |
+
switch (w.code) {
|
| 5029 |
+
case "horizon_wasted": return `${t("launch.warn.horizon_wasted")} (d_horizon ≈ ${_yarnFmtK(w.params.dHoriz)}, L=${_yarnFmtK(w.params.target)})`;
|
| 5030 |
+
case "beyond_trained": return `${t("launch.warn.beyond_trained")} (${_yarnFmtK(w.params.ctxTrain)} → ${_yarnFmtK(w.params.target)})`;
|
| 5031 |
+
case "no_mmap_blackwell":return t("launch.warn.no_mmap");
|
| 5032 |
+
case "partial_offload": return `${t("launch.warn.partial")} (${w.params.ngl}/${w.params.nLayers})`;
|
| 5033 |
+
case "cpu_only": return t("launch.warn.cpu_only");
|
| 5034 |
+
case "no_params": return t("launch.warn.no_params");
|
| 5035 |
+
default: return w.code;
|
| 5036 |
+
}
|
| 5037 |
+
}
|
| 5038 |
+
|
| 5039 |
+
function renderLaunch(p) {
|
| 5040 |
+
const out = $("launch-output");
|
| 5041 |
+
if (!out) return;
|
| 5042 |
+
out.style.display = "";
|
| 5043 |
+
const errMap = { no_geometry: "launch.err.no_geom", no_gpu: "launch.err.no_gpu", no_ctx: "launch.err.no_ctx" };
|
| 5044 |
+
if (errMap[p.verdict]) { out.innerHTML = `<div class="gc-validity-warning">⚠ ${t(errMap[p.verdict])}</div>`; return; }
|
| 5045 |
+
|
| 5046 |
+
const meta = ({
|
| 5047 |
+
fits: { emoji: "✅", cls: "v-yes" },
|
| 5048 |
+
partial: { emoji: "⚠️", cls: "v-deg" },
|
| 5049 |
+
too_big: { emoji: "🚨", cls: "v-no" },
|
| 5050 |
+
})[p.verdict] || { emoji: "❓", cls: "v-deg" };
|
| 5051 |
+
|
| 5052 |
+
const cmds = launchCommands(p);
|
| 5053 |
+
const td = "padding:3px 12px 3px 0;";
|
| 5054 |
+
const gb = n => (n == null ? "—" : n.toFixed(1) + " GB");
|
| 5055 |
+
const warnHtml = p.warnings.map(w => `<li>${_launchWarnText(w)}</li>`).join("");
|
| 5056 |
+
|
| 5057 |
+
out.innerHTML = `
|
| 5058 |
+
<p><span class="verdict-badge ${meta.cls}">${meta.emoji} ${t("launch.verdict." + p.verdict)}</span></p>
|
| 5059 |
+
<table style="border-collapse:collapse;font-size:0.95em;margin:0.5em 0;">
|
| 5060 |
+
<tr><td style="${td}">${t("launch.r.weights")}</td><td>${gb(p.weightsGB)} <span class="subtle">(${p.quant}, ${p.bpw} bpw)</span></td></tr>
|
| 5061 |
+
<tr><td style="${td}">${t("launch.r.kv")}</td><td>${gb(p.kvGB)} <span class="subtle">(${p.cacheType}${p.flashAttn ? ", -fa" : ""})</span></td></tr>
|
| 5062 |
+
<tr><td style="${td}">${t("launch.r.overhead")}</td><td>${gb(p.overheadGB)}</td></tr>
|
| 5063 |
+
<tr style="border-top:1px solid var(--border);"><td style="${td}"><strong>${t("launch.r.total")}</strong></td><td><strong>${gb(p.totalGB)}</strong> / ${gb(p.vramGB)} VRAM</td></tr>
|
| 5064 |
+
<tr><td style="${td}">${t("launch.r.ngl")}</td><td><strong>${p.allOnGpu ? `${p.nLayers} (${t("launch.r.all")})` : `${p.ngl} / ${p.nLayers}`}</strong></td></tr>
|
| 5065 |
+
</table>
|
| 5066 |
+
<h3>llama.cpp</h3>
|
| 5067 |
+
<pre class="diag-cmd-box">${escapeHtml(cmds.llamacpp)}</pre>
|
| 5068 |
+
<button id="launch-copy-llama" class="secondary">📋 ${t("launch.copy")}</button>
|
| 5069 |
+
<h3 style="margin-top:0.8em;">Ollama</h3>
|
| 5070 |
+
<pre class="diag-cmd-box">${escapeHtml(cmds.ollama)}</pre>
|
| 5071 |
+
${warnHtml ? `<ul style="font-size:0.9em;margin-top:0.8em;opacity:0.9;">${warnHtml}</ul>` : ""}
|
| 5072 |
+
<p class="subtle" style="font-size:0.86em;">${t("launch.r.note")}</p>`;
|
| 5073 |
+
|
| 5074 |
+
$("launch-copy-llama")?.addEventListener("click", async () => {
|
| 5075 |
+
try { await navigator.clipboard.writeText(cmds.llamacpp); $("launch-copy-llama").textContent = "✓ " + t("yarn.copied"); } catch (e) {}
|
| 5076 |
+
});
|
| 5077 |
+
}
|
| 5078 |
+
|
| 5079 |
// ════════════════════════════════════════════════════════════════════
|
| 5080 |
// Bootstrap
|
| 5081 |
// ════════════════════════════════════════════════════════════════════
|
|
@@ -157,7 +157,7 @@ unless otherwise noted by the contributor. The TAF Agent code itself is
|
|
| 157 |
|
| 158 |
- 🔬 [TAF Agent web tool](https://karlesmarin.github.io/tafagent) — the diagnostic itself
|
| 159 |
- 📦 [TAF Agent source](https://github.com/karlesmarin/tafagent) — open source
|
| 160 |
-
- 📄 [Underlying paper](https://zenodo.org/records/
|
| 161 |
*Predicting How Transformers Attend*
|
| 162 |
|
| 163 |
---
|
|
|
|
| 157 |
|
| 158 |
- 🔬 [TAF Agent web tool](https://karlesmarin.github.io/tafagent) — the diagnostic itself
|
| 159 |
- 📦 [TAF Agent source](https://github.com/karlesmarin/tafagent) — open source
|
| 160 |
+
- 📄 [Underlying paper](https://zenodo.org/records/20314038) — Marin 2026,
|
| 161 |
*Predicting How Transformers Attend*
|
| 162 |
|
| 163 |
---
|
|
@@ -0,0 +1,77 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import { chromium } from "playwright";
|
| 2 |
+
const b = await chromium.launch({ headless: true });
|
| 3 |
+
const p = await (await b.newContext()).newPage();
|
| 4 |
+
const errors=[]; const benign=s=>/40\d/.test(s);
|
| 5 |
+
p.on("console",m=>{if(m.type()==="error"&&!benign(m.text()))errors.push("[c]"+m.text());});
|
| 6 |
+
p.on("pageerror",e=>errors.push("[pe]"+e.message));
|
| 7 |
+
const log=s=>process.stdout.write(s+"\n"); let pass=0,fail=0;
|
| 8 |
+
const check=(n,c,x="")=>{log(`${c?" OK ":" FAIL"} ${n} ${x}`);c?pass++:fail++;};
|
| 9 |
+
|
| 10 |
+
await p.goto("http://127.0.0.1:8000/index.html",{waitUntil:"domcontentloaded",timeout:90000});
|
| 11 |
+
await p.waitForTimeout(2500);
|
| 12 |
+
await p.click(`.lang-btn[data-lang="en"]`); await p.waitForTimeout(200);
|
| 13 |
+
check("module loads, 0 errors", errors.length===0, `(${errors.length})`);
|
| 14 |
+
|
| 15 |
+
await p.click('[data-mode-link="launch"]',{timeout:5000}); await p.waitForTimeout(400);
|
| 16 |
+
check("section visible", await p.evaluate(()=>{const s=document.querySelector("#launch-section");return s&&getComputedStyle(s).display!=="none";}));
|
| 17 |
+
check("GPU presets populated", await p.evaluate(()=>document.querySelector("#launch-gpu").options.length>5));
|
| 18 |
+
|
| 19 |
+
log("\n── Fetch geometry ──");
|
| 20 |
+
await p.fill("#launch-model","Qwen/Qwen2.5-7B-Instruct");
|
| 21 |
+
await p.keyboard.press("Escape");
|
| 22 |
+
await p.click("#launch-fetch-btn"); await p.waitForTimeout(3500);
|
| 23 |
+
const st=await p.evaluate(()=>document.querySelector("#launch-status").innerText);
|
| 24 |
+
check("geometry fetched (layers/GQA shown)", /layers|GQA|θ=/.test(st), st.slice(0,70));
|
| 25 |
+
check("ctx auto-filled", await p.evaluate(()=>!!document.querySelector("#launch-ctx").value));
|
| 26 |
+
|
| 27 |
+
async function gen({quant,gpu,vram,ctx,cache,fa}){
|
| 28 |
+
if(quant) await p.selectOption("#launch-quant",quant);
|
| 29 |
+
if(gpu) await p.selectOption("#launch-gpu",gpu);
|
| 30 |
+
await p.fill("#launch-vram",vram!=null?String(vram):"");
|
| 31 |
+
if(ctx!=null) await p.fill("#launch-ctx",String(ctx));
|
| 32 |
+
if(cache) await p.selectOption("#launch-cache",cache);
|
| 33 |
+
if(fa!=null){const c=await p.isChecked("#launch-fa"); if(c!==fa) await p.click("#launch-fa");}
|
| 34 |
+
await p.click("#launch-gen-btn"); await p.waitForTimeout(300);
|
| 35 |
+
return p.evaluate(()=>{const o=document.querySelector("#launch-output");return{
|
| 36 |
+
verdict:o.querySelector(".verdict-badge")?.innerText?.trim()||"", text:o.innerText};});
|
| 37 |
+
}
|
| 38 |
+
|
| 39 |
+
log("\n── FITS case (7B Q4 on 24GB) ──");
|
| 40 |
+
let r=await gen({quant:"Q4_K_M",gpu:"24",vram:null,ctx:32768,cache:"fp16",fa:true});
|
| 41 |
+
check("verdict FITS", /FITS/.test(r.verdict), r.verdict);
|
| 42 |
+
check("ngl = all layers", /all|28/.test(r.text));
|
| 43 |
+
check("llama-server cmd present", /llama-server/.test(r.text));
|
| 44 |
+
check("ollama cmd present", /ollama|num_ctx/.test(r.text));
|
| 45 |
+
check("--no-mmap added when all-on-GPU", /--no-mmap/.test(r.text));
|
| 46 |
+
check("-fa present", /-fa/.test(r.text));
|
| 47 |
+
check("VRAM breakdown (weights/KV)", /Weights|KV cache/.test(r.text));
|
| 48 |
+
|
| 49 |
+
log("\n── PARTIAL case (7B Q4 on tiny 3GB custom) ──");
|
| 50 |
+
r=await gen({quant:"Q4_K_M",vram:3,ctx:8192,fa:true});
|
| 51 |
+
check("verdict PARTIAL or TOO BIG", /PARTIAL|TOO BIG/.test(r.verdict), r.verdict);
|
| 52 |
+
check("partial offload warning or cpu-only", /CPU|layers fit|smaller quant/i.test(r.text));
|
| 53 |
+
|
| 54 |
+
log("\n── cache quant changes KV flag ──");
|
| 55 |
+
r=await gen({quant:"Q4_K_M",gpu:"24",vram:null,ctx:32768,cache:"q8_0",fa:true});
|
| 56 |
+
check("KV cache q8_0 → -ctk/-ctv in cmd", /-ctk q8_0/.test(r.text));
|
| 57 |
+
|
| 58 |
+
log("\n── beyond-trained warning ──");
|
| 59 |
+
r=await gen({quant:"Q4_K_M",gpu:"80",vram:null,ctx:262144,cache:"fp16",fa:true});
|
| 60 |
+
check("L beyond trained → warning", /trained|RoPE|YaRN/i.test(r.text), "L=256K");
|
| 61 |
+
|
| 62 |
+
log("\n── error: generate before fetch (fresh) ──");
|
| 63 |
+
// can't easily un-fetch; just check error key exists by clearing geom via reload-free path is hard; skip
|
| 64 |
+
|
| 65 |
+
log("\n── 4 languages ──");
|
| 66 |
+
for(const lang of ["es","fr","zh","en"]){
|
| 67 |
+
await p.click(`.lang-btn[data-lang="${lang}"]`); await p.waitForTimeout(250);
|
| 68 |
+
const lbl=await p.evaluate(()=>document.querySelector('.mode-btn[data-mode="launch"]')?.textContent?.trim());
|
| 69 |
+
check(`${lang}: tab label`, lbl&&lbl.length>3, lbl);
|
| 70 |
+
}
|
| 71 |
+
|
| 72 |
+
check("copy button present", await p.evaluate(()=>!!document.querySelector("#launch-copy-llama")));
|
| 73 |
+
|
| 74 |
+
log(`\n=== ${pass} passed, ${fail} failed · JS errors: ${errors.length} ===`);
|
| 75 |
+
errors.slice(0,10).forEach(e=>log(e));
|
| 76 |
+
await b.close();
|
| 77 |
+
process.exit(fail>0?1:0);
|