🔬 TAF Agent

Test ANY transformer LLM before you spend GPU/$.

✓ RoPE-MHA ✓ RoPE-GQA ✓ ALiBi ✓ AbsPE ✓ SWA ✓ SSM (Mamba) ✓ Any HuggingFace public model

All computation runs locally in your browser. Free. Unlimited. Auditable.

Built by an independent researcher. Open source. Not affiliated with any model vendor.

📘 TAF Agent — User Manual

What does it do?

Predicts practical viability of any transformer LLM before you spend GPU/$. Answers questions like "will this model work at L=32K?" or "should I train custom or use API?" using deterministic Python formulas (TAF — Thermodynamic Attention Framework).

How to use — 7 modes

📇 Profile: paste model id → all recipes at once = TAF Card. Best starting point.

🆚 Compare: 2-3 models side-by-side on same recipe. Best when choosing between candidates.

🔍 Inspect config: paste raw config.json → tool parses + runs full Profile. For private models, in-development configs, or models not yet on HF Hub.

💬 Ask plain English: free-form question, in-browser LLM picks the recipe. Best for casual exploration.

📋 Recipe + form: manual selection, full parameter control. Best when you want exact control.

🩺 Diagnose CLI: generate Python command to measure γ on your local machine (transformers + numpy). Fast ≈5 min CPU; full ≈20–60 min GPU. Output JSON re-uploadable via Inspect.

📊 Phase diagram: scatter plot of 23 panel models on (log θ, γ) plane. Hagedorn line γ=1 separates Phase A from Phase B. Click a dot to load that model into Recipe form.

The 8 recipes available

X-1 Custom training vs API — compares cost of training your own model vs paying for API access.

Try: "Should I train an 8B custom model or use GPT-4o for 50M tokens/month?"
Answer types: YES (custom) / NO (API) with break-even months.

X-2 Long Context Viability — predicts if a model serves a target context length reliably.

Try: "Will Meta-Llama-3-8B handle 32000 tokens for retrieval?"
Chains: γ_Padé → decomposition → d_horizon → NIAH ceiling → hallucination → KV memory.
Verdict: YES / DEGRADED / NO with mitigation if needed.

X-3 Budget pre-flight — given $ budget, what model is feasible to train?

Try: "I have $5000, what model can I train?"
Answer: GO / TINY-MODEL / MEMORY-LIMITED with concrete N (params) and D (tokens).

X-5 Hardware selection — which GPU should I use to serve at target throughput?

Try: "Cheapest hardware to serve Llama-3-8B at 10M tokens/day"
Answer: best GPU + $/Mtok + capacity vs target.

X-19 KV Compression decision — should I use soft decay, hard cutoff, or literature methods?

Try: "How to compress KV cache for Qwen2.5-7B at 32K?"
Answer: USE SOFT DECAY / USE D_f CUTOFF / USE LITERATURE METHODS / USE HARD T_train.

— v0.4 (sesión 29 findings) —

What's new in v0.4 (sesión 29 findings 2026-04-28): three diagnostic recipes derived from cross-model panel analysis (n=22 LLMs).

X-21 Imprint Purity Diagnostic — predicts γ on RANDOM tokens via ν=−1/(2π); how clean is the model's RoPE prediction?

Try: "How clean is the RoPE prediction on Llama-3-8B?"
Answer: predicted γ_random + purity diagnostic (CLEAN / OVER-IMPRINTED / UNDER-IMPRINTED).

Learned-imprint slope ν = −1/(2π): RoPE rotation period 2π drives a positional bias on weights, proportional to log(N_params). Even random tokens show this scaling. ν is DERIVED — not fitted (empirical err 0.3%).

X-22 Compute-Context Invariant — does γ × log(N²·D) lie in panel band 51.2 ± 16.8? Detects scaling/training anomalies.

Try: "Does Mistral-7B fit the compute-context invariant?"
Answer: K = γ·log(N²·D), z-score, IN-BAND or OUTLIER.

Chinchilla-attention invariant K: γ × log(N²·D) ≈ 51.2 ± 16.8 (CV=0.329). Connects compute scaling and attention exponent into a single dimensionless number.

X-23 IH-Phase Detector — pre- or post-induction-head? Cheap probe via sign(γ_text − γ_random).

Try: "Is Qwen2.5-7B post-induction-head?"
Answer: CONFIRMED PRE-IH / CONFIRMED POST-IH / ANOMALY (with size-vs-Δγ consistency check).

Δγ as IH probe: sign(γ_text − γ_random) > 0 ⟺ post-induction-head. Cheaper than running an in-context-learning benchmark.

γ-cluster on famous constants (intriguing, n=4): CodeLlama-13b γ=0.382 ≈ 1−1/φ (golden conjugate, err 0.0003); pythia-1.4b γ=0.705 ≈ 1/√2; Llama-2-7b γ=0.287 ≈ 1−1/√2; Mistral-Nemo γ=0.428 ≈ log_10(e). Caveat: could be coincidence.

🆕 v0.4 — New diagnostics (sesion 31)

Four new diagnostic functions derived sesion 31 (2026-04-30) from cross-of-crosses formula games + Sócratic interrogation. Available in taf_browser.py §33.

Architectural Concentrationγ_text ≈ γ_Padé − 0.012·n_kv. Cross-panel correlational law (R²=0.30). Caveat: not per-model predictor.

PDI — Padé Deviation IndexPDI = d_horizon_obs/T_eval. Traffic light: green (≈1), orange (>>1), yellow (<<1), red (Phase B negative).

4-bit Shift PredictorMHA: R²(bf16)<0.9 → γ rises; R²>0.99 → γ drops. GQA: precision-robust regardless.

Critical Exponents Bundleν_c, β_c, η_c (=γ−1, CORRECTED), α_C, γ_susc with AM-GM minimum at γ=1−1/√2≈0.293.

🔬 v0.5 — Machine-verified consistency (sesion 32)

Sage Groebner basis + Lean Mathlib4 dual-tool verification of 15 algebraic identities of TAF critical exponents. First transformer-attention framework with formal machine-proof backing.

Algebraic Consistency CheckGiven measured γ, verifies 12 D-SAGE identities (D-SAGE-1: 2η²+η·γ_χ+1=0, β·χ=−1, α+χ=2, etc.). All passing = framework intact. Failures indicate bf16 outliers / quantization artifacts.

D-SAGE-1 (★★ core)Quadratic identity 2η² + η·γ_χ + 1 = 0 (Sage Groebner-discovered, Lean-verified). Replaces incorrect 'triple closure' claim. Refutes paper 1's η=2γ algebraically.

Paper 1 erratum — η correctionPaper 1 originally claimed η = 2γ. Sage Groebner + Lean Mathlib4 proved this fails (residual (-4γ³+5γ+1)/(1-γ) > 0 ∀γ ∈ Phase A). Correct value: η = γ−1, satisfying D-SAGE-1.

ReproducibilityAll 15 theorems machine-proof in Lean Mathlib4 (1973 jobs build success). Sage script: analysis/sage_recursive_sweep_2026-04-30.sage. Lean code: lean_taf/taf/Taf/Identities.lean.

🆕 v0.6 — γ predicted-vs-observed + Cardy ΔH + Lean badges

v0.6 (2026-05-06): three new diagnostics live in the TAF Card under 🔬 Diagnostics. All run in your browser; γ_observed comes from the Diagnose CLI on real weights.

TAF Card layout (new in v0.6)

After clicking 🚀 Generate full profile the card shows: a hero strip on top (architecture class + meta + 3 pills: aggregate verdict ✅/⚠/❌, γ headline, 🧲 Anti-Ising if Phase A) and four expandable sections: 📋 Recipes (open by default — verdict per dimension), 🔬 Diagnostics (key numbers, γ predicted vs observed, what-if explorer), ✓ Verification (Sage+Lean algebraic consistency, falsification F1-F23), 📂 Provenance & share (calibration audit + JSON download / share link / registry submit). Click any header to expand. Every variable has an inline tooltip.

γ predicted vs observed

Enter the empirically-measured γ from your model and the tool computes η = θ_eff_obs / θ_eff_Padé and classifies into one of 5 regimes:

Cardy ΔH diagnostic

ΔH_Cardy = log(θ_eff_obs / θ_nominal). Entropy shift between observed effective θ and nominal θ. Strong negative = compression entropy; near zero = nominal match. Complements η for borderline cases.

Lean + Mathlib verification badges

TAF identities (Anti-Ising, D-SAGE-1 quadratic, Padé z-substitution, etc.) are formally machine-proven in Lean Mathlib4. Source: github.com/karlesmarin/lean-taf. Anyone can clone + lake build to re-verify. The 🧲 Anti-Ising pill in the hero strip is one such badge.

Variable glossary (also embedded in TAF Card)

Every variable in the TAF Card has an inline ⓘ tooltip. The complete list: γ, γ_Padé, γ_decomposed, γ_observed, θ, θ_eff_obs, θ_eff_Padé, η, ΔH_Cardy, χ, d_horizon, L_NIAH, KV memory, regime. Hover any ⓘ for the definition + paper section.

Adding new models (3 ways)

The audit chain

Every result shows the full Computation Chain — each formula step with its inputs, output, and interpretation. Click any step to expand. Cite section numbers (§26.1, §19.1, etc.) refer to the underlying paper for derivation.

The plain-English answer

After the deterministic chain runs, an in-browser LLM (Qwen2.5-0.5B, ~350MB cached after first load) synthesizes a plain-English summary. The numbers above are always correct (deterministic Python); the synthesis is LLM-generated — verify against the chain if in doubt.

Common parameters explained

What to look for in verdicts

Privacy

Everything runs in your browser. No telemetry, no analytics, no data sent anywhere. Even the LLM model runs locally via WebGPU/WebAssembly. Your model_ids and questions never leave this page.

Source & paper

Source code: github.com/karlesmarin/tafagent
Paper: Marin 2026 — Predicting How Transformers Attend (Zenodo; arXiv forthcoming)
Dataset: taf-attention-decay — 58 γ-measurements across 32 models (CC-BY-4.0)

⏳ Loading Python runtime...

🎯 Mode Four ways to use the tool.
📇 Profile: paste a model id → all 5 recipes at once = TAF Card.
🆚 Compare: 2-3 models side-by-side on one recipe.
💬 Ask: free-form question, browser LLM picks the recipe.
📋 Recipe: manual selection with full form control.

Quickest start: paste any HuggingFace model id (e.g. meta-llama/Meta-Llama-3-8B), click Profile. See all 5 recipes scored in seconds.

💡 Quick start: pick any preset → click Generate. Or paste a model id from HF Hub trending → 📥 Fetch → Generate.

📇 Profile a model One-click full diagnosis. Paste any HF model id (or pick preset). Tool runs all 5 recipes (long-context, KV-compression, custom-vs-API, budget, hardware) and produces a single TAF Card showing verdict per dimension + key numbers + architecture classification.

Use case: "I'm evaluating Qwen2.5-32B for production — what's its full viability profile?" → paste id → Profile → done.

For technicians: when you need a complete viability snapshot of a candidate model. Outputs match paper §sec:gamma_decomposition format.

📂 Import a shared TAF result

Got a JSON file from someone else's TAF analysis? Load it here to see the verdict + chain locally. Same view as if you'd run it yourself.

🌐 Recent community submissions

Live feed from the public registry. Click any submission to view full analysis. Browse all →

Loading...

🔬 Paper predictions — falsification status

The TAF framework rests on falsifiable predictions (F1-F23). Each is empirically tested. Here's the live status of every prediction in the paper.