Diagnose any transformer LLM in 30 seconds. Free. No GPU. No signup.
Predicts whether a model will work for your use case before you spend money or time. Everything runs in your browser — your inputs never leave this tab.
Built by an independent researcher. Open source. Not affiliated with any model vendor.
📘 TAF Agent — User Manual
What does it do?
Predicts practical viability of any transformer LLM
before you spend GPU/$. Answers questions like "will this model work at L=32K?" or
"should I train custom or use API?" using deterministic Python formulas (TAF — Thermodynamic Attention Framework).
How to use — 7 modes
📇 Profile: paste model id → all recipes at once = TAF Card. Best starting point.
🆚 Compare: 2-3 models side-by-side on same recipe. Best when choosing between candidates.
🔍 Inspect config: paste raw config.json → tool parses + runs full Profile. For private models, in-development configs, or models not yet on HF Hub.
💬 Ask plain English: free-form question, in-browser LLM picks the recipe. Best for casual exploration.
📋 Recipe + form: manual selection, full parameter control. Best when you want exact control.
🩺 Diagnose CLI: generate Python command to measure γ on your local machine (transformers + numpy). Fast ≈5 min CPU; full ≈20–60 min GPU. Output JSON re-uploadable via Inspect.
📊 Phase diagram: scatter plot of 23 panel models on (log θ, γ) plane. Hagedorn line γ=1 separates Phase A from Phase B. Click a dot to load that model into Recipe form.
The 8 recipes available
X-1 Custom training vs API — compares cost of training your own model vs paying for API access.
Try: "Should I train an 8B custom model or use GPT-4o for 50M tokens/month?"
Answer types: YES (custom) / NO (API) with break-even months.
X-2 Long Context Viability — predicts if a model serves a target context length reliably.
Try: "Will Meta-Llama-3-8B handle 32000 tokens for retrieval?"
Chains: γ_Padé → decomposition → d_horizon → NIAH ceiling → hallucination → KV memory.
Verdict: YES / DEGRADED / NO with mitigation if needed.
X-3 Budget pre-flight — given $ budget, what model is feasible to train?
Try: "I have $5000, what model can I train?"
Answer: GO / TINY-MODEL / MEMORY-LIMITED with concrete N (params) and D (tokens).
X-5 Hardware selection — which GPU should I use to serve at target throughput?
Try: "Cheapest hardware to serve Llama-3-8B at 10M tokens/day"
Answer: best GPU + $/Mtok + capacity vs target.
X-19 KV Compression decision — should I use soft decay, hard cutoff, or literature methods?
Try: "How to compress KV cache for Qwen2.5-7B at 32K?"
Answer: USE SOFT DECAY / USE D_f CUTOFF / USE LITERATURE METHODS / USE HARD T_train.
— v0.4 (sesión 29 findings) —
What's new in v0.4 (sesión 29 findings 2026-04-28): three diagnostic recipes derived from cross-model panel analysis (n=22 LLMs).
X-21 Imprint Purity Diagnostic — predicts γ on RANDOM tokens via ν=−1/(2π); how clean is the model's RoPE prediction?
Try: "How clean is the RoPE prediction on Llama-3-8B?"
Answer: predicted γ_random + purity diagnostic (CLEAN / OVER-IMPRINTED / UNDER-IMPRINTED).
Learned-imprint slope ν = −1/(2π): RoPE rotation period 2π drives a positional bias on weights, proportional to log(N_params). Even random tokens show this scaling. ν is DERIVED — not fitted (empirical err 0.3%).
X-22 Compute-Context Invariant — does γ × log(N²·D) lie in panel band 51.2 ± 16.8? Detects scaling/training anomalies.
Try: "Does Mistral-7B fit the compute-context invariant?"
Answer: K = γ·log(N²·D), z-score, IN-BAND or OUTLIER.
Chinchilla-attention invariant K: γ × log(N²·D) ≈ 51.2 ± 16.8 (CV=0.329). Connects compute scaling and attention exponent into a single dimensionless number.
X-23 IH-Phase Detector — pre- or post-induction-head? Cheap probe via sign(γ_text − γ_random).
Δγ as IH probe: sign(γ_text − γ_random) > 0 ⟺ post-induction-head. Cheaper than running an in-context-learning benchmark.
γ-cluster on famous constants (intriguing, n=4): CodeLlama-13b γ=0.382 ≈ 1−1/φ (golden conjugate, err 0.0003); pythia-1.4b γ=0.705 ≈ 1/√2; Llama-2-7b γ=0.287 ≈ 1−1/√2; Mistral-Nemo γ=0.428 ≈ log_10(e). Caveat: could be coincidence.
🆕 v0.4 — New diagnostics (sesion 31)
Four new diagnostic functions derived sesion 31 (2026-04-30) from cross-of-crosses formula games + Sócratic interrogation. Available in taf_browser.py §33.
Architectural Concentration — γ_text ≈ γ_Padé − 0.012·n_kv. Cross-panel correlational law (R²=0.30). Caveat: not per-model predictor.
PDI — Padé Deviation Index — PDI = d_horizon_obs/T_eval. Traffic light: green (≈1), orange (>>1), yellow (<<1), red (Phase B negative).
v0.6 (2026-05-06): three new diagnostics live in the TAF Card under 🔬 Diagnostics. All run in your browser; γ_observed comes from the Diagnose CLI on real weights.
TAF Card layout (new in v0.6)
After clicking 🚀 Generate full profile the card shows: a hero strip on top (architecture class + meta + 3 pills: aggregate verdict ✅/⚠/❌, γ headline, 🧲 Anti-Ising if Phase A) and four expandable sections: 📋 Recipes (open by default — verdict per dimension), 🔬 Diagnostics (key numbers, γ predicted vs observed, what-if explorer), ✓ Verification (Sage+Lean algebraic consistency, falsification F1-F23), 📂 Provenance & share (calibration audit + JSON download / share link / registry submit). Click any header to expand. Every variable has an inline ⓘ tooltip.
γ predicted vs observed
Enter the empirically-measured γ from your model and the tool computes η = θ_eff_obs / θ_eff_Padé and classifies into one of 5 regimes:
Normal (η ∈ [0.85, 1.15]) — model uses its full nominal context. Use case: validate a new release before adopting it.
Fraud (η < 0.01) — nominal θ inflated; model behaves as if θ ≪ advertised. Use case: detect YaRN/marketing inflation (CodeLlama / Mistral-Nemo pattern).
Compressed (η < 0.5) — context compressed; model attends shorter than nominal θ. Use case: spot RLHF/instruction-tuning compression (LLaMA-2 pattern).
Over-Padé (η > 1.5) — model attends farther than Padé predicts. Use case: identify Lerch-corrected regime or undertrained early checkpoints (pythia-1b pattern).
SWA random-corpus (γ_obs > 1.05 with random_corpus=Yes) — sliding-window attention signature. Use case: confirm Mistral / Gemma SWA on random tokens.
Cardy ΔH diagnostic
ΔH_Cardy = log(θ_eff_obs / θ_nominal). Entropy shift between observed effective θ and nominal θ. Strong negative = compression entropy; near zero = nominal match. Complements η for borderline cases.
Lean + Mathlib verification badges
TAF identities (Anti-Ising, D-SAGE-1 quadratic, Padé z-substitution, etc.) are formally machine-proven in Lean Mathlib4. Source: github.com/karlesmarin/lean-taf. Anyone can clone + lake build to re-verify. The 🧲 Anti-Ising pill in the hero strip is one such badge.
Variable glossary (also embedded in TAF Card)
Every variable in the TAF Card has an inline ⓘ tooltip. The complete list: γ, γ_Padé, γ_decomposed, γ_observed, θ, θ_eff_obs, θ_eff_Padé, η, ΔH_Cardy, χ, d_horizon, L_NIAH, KV memory, regime. Hover any ⓘ for the definition + paper section.
Adding new models (3 ways)
Preset list: 11 popular models curated. Just select from dropdown.
HF Hub fetch: paste any model id (e.g. Qwen/Qwen2.5-32B-Instruct),
click 📥 Fetch. Browser downloads config.json directly from HuggingFace, fills the form. Works for any public model.
Manual: fill the form fields directly with values from the model card.
🆕 v0.7 — Anti-bullshit pack (4 new modes)
v0.7 (2026-05-06): four new modes that solve concrete pain points reported by the HuggingFace community. Each one runs in your browser with no inference — pure metadata + math.
🪟 Context Unmasker
Detects when max_position_embeddings is misleading. Mistral-7B-v0.1 declares 32k but attends within ~4-8k via SWA. Paste an HF model id → 1-second verdict (HONEST / INFLATED / SEVERELY INFLATED / YARN-EXTENDED). Catches SWA, RoPE-scaling (YaRN/linear/dynamic NTK), small-d_head + GQA. Use case: before paying GPU for 32k context, verify the model actually attends that far.
📜 Chat-template Sniffer
Detects which chat-template family a model uses (Llama-3 / ChatML / Mistral / Gemma / Phi-3 / Alpaca / DeepSeek / custom / none) and gives you the exact CLI flag for lm-evaluation-harness, vLLM, and transformers. Solves issue #1841 in lm-eval-harness: forgetting --apply_chat_template silently halves multi-turn accuracy. Use case: before reporting a benchmark score, confirm you applied the template correctly.
🎯 Arena-Elo CI Reconstructor
Chatbot Arena strips confidence intervals from its public leaderboard — a 5-Elo gap can be statistically meaningless. Paste raw pairwise vote data (model_a, model_b, winner) → Bradley-Terry MLE + 200-iteration bootstrap → ranked Elos with 95% CIs and a "statistical ties" panel listing pairs whose CIs overlap. Try the Load sample button. Use case: before declaring "model A beats model B", verify their CIs don't overlap.
🧪 Contamination Prior
Bayesian-ish prior on whether a benchmark score is contaminated. Enter your model's training cutoff date → tool rates 20+ popular benchmarks (MMLU, HellaSwag, GSM8K, HumanEval, IFEval, MMLU-Pro, GPQA, AIME, MATH-500, BBH, MUSR…) by P(contamination) based on time gap, corpus inclusion, and known leak history. Open LLM Leaderboard v1 was killed in 2024 after MMLU/HellaSwag scores became contaminated. Use case: decide which scores to trust when comparing two models.
The audit chain
Every result shows the full Computation Chain — each formula step with its inputs,
output, and interpretation. Click any step to expand. Cite section numbers (§26.1, §19.1, etc.) refer
to the underlying paper for derivation.
The plain-English answer
After the deterministic chain runs, an in-browser LLM (Qwen2.5-0.5B, ~350MB cached after first load)
synthesizes a plain-English summary. The numbers above are always correct (deterministic Python);
the synthesis is LLM-generated — verify against the chain if in doubt.
Common parameters explained
θ (rope_theta): RoPE base frequency. Higher = more long-range capacity. Typical: 10000 (early), 500000 (Llama-3), 1000000 (Qwen2.5).
T_train: max context the model was trained on. From max_position_embeddings.
T_eval: your target inference context length. The key knob.
n_kv_heads < n_attention_heads: model uses GQA (Grouped Query Attention). Reduces KV memory but pushes γ toward Hagedorn.
has_SWA: model uses Sliding Window Attention (Mistral, gemma-2).
n_params: total parameter count. Threshold ~400M for induction-head emergence.
What to look for in verdicts
YES / GO — proceed with confidence; numbers support the choice.
DEGRADED / TINY-MODEL — works but with caveats; read the action.
NO / MEMORY-LIMITED — don't proceed as-is; mitigation provided.
Privacy
Everything runs in your browser. No telemetry, no analytics, no data sent anywhere. Even the LLM model
runs locally via WebGPU/WebAssembly. Your model_ids and questions never leave this page.
Custom train vs API: which is cheaper for your traffic?
Long context: will it handle 32k / 128k tokens reliably?
Budget: with $X, what model can you train from scratch?
Hardware: which GPU to serve N tokens/day?
KV cache: how to compress without breaking quality?
Imprint purity: how clean is the model's positional encoding?
Compute-context: does the model fit the empirical band?
IH-phase: pre- or post-induction-head?
🔬 Diagnostics
γ predicted vs observed — auto-classifies the model into 5 regimes (normal · fraud / inflated context · compressed · over-Padé · sliding-window)
Cardy ΔH — entropy shift between observed and nominal context
Falsification dashboard — checks 23 specific predictions (F1–F23)
Algebraic consistency — 8 mathematical identities the model must satisfy
✓ Formally verified math
37 theorems machine-proven in Lean 4 + Mathlib4
Click any badge → opens the source line on GitHub
Verify yourself: lake build (≈5 s after cache fetch)
📤 Export & share
JSON · Markdown · LaTeX (paper-ready)
Reproducible share link (state encoded in URL)
Submit to community registry on GitHub
🆕 v0.7 anti-bullshit pack
🪟 Unmask — config.json claims 32k? See if it actually attends that far
📜 Chat-template — exact CLI flag so lm-eval doesn't silently halve your accuracy
🎯 Arena CI — recover the confidence intervals Chatbot Arena hides
🧪 Contamination — rate 20+ benchmarks for contamination probability
Architectures supported (click to expand)
✓ RoPE-MHA Multi-Head Attention: each token position attends through several parallel heads at once.✓ RoPE-GQA Grouped Query Attention: queries share fewer keys/values than heads (saves memory but pushes γ toward Hagedorn).✓ ALiBi Attention with Linear Biases: position info is a learned slope added to attention scores, no rotation.✓ AbsPE Absolute Position Embeddings: each position has a fixed learned vector added to the token embedding.✓ SWA Sliding Window Attention: each token only attends within a fixed local window (Mistral, gemma-2 use this).✓ SSM (Mamba) State Space Model: a sequence layer that maintains internal state instead of attention (Mamba, Jamba use this).✓ Any HuggingFace public model
⏳ Loading Python runtime...
🎯 Mode7 modes available. Most users want 📇 Profile (one-click full diagnosis). 📇 Profile: paste a model id → 5-recipe TAF Card. 🆚 Compare: 2-3 models side-by-side on one recipe. 🔍 Inspect: paste raw config.json to debug parameters. 💬 Ask: free-form question, browser LLM picks the recipe. 📋 Recipe: manual selection with full form control. 🩺 Diagnose CLI: generate Python command to measure γ on real weights. 📊 Phase diagram: explore 23 panel models on (log θ, γ) plane.
Quickest start: paste any HuggingFace model id (e.g. meta-llama/Meta-Llama-3-8B),
click Profile. See all 5 recipes scored in seconds.
💡 Quick start: pick any preset → click Generate. Or paste a model id from HF Hub trending → 📥 Fetch → Generate.
📇 Profile a modelOne-click full diagnosis. Paste any HF model id (or pick preset).
Tool runs all 5 recipes (long-context, KV-compression, custom-vs-API, budget,
hardware) and produces a single TAF Card showing verdict per
dimension + key numbers + architecture classification.
Use case: "I'm evaluating Qwen2.5-32B for production —
what's its full viability profile?" → paste id → Profile → done.
For technicians: when you need a complete viability snapshot
of a candidate model. Outputs match paper §sec:gamma_decomposition format.
💡 Use case: you have a private model not on HF Hub, or a config you're designing. Paste the raw JSON below and get a full TAF profile.
🔍 Architecture InspectorPaste any config.json directly. Tool parses it and runs the full Profile.
Useful for: private models, in-development configs, models not yet on HuggingFace,
or comparing what your custom architecture would do.
Paste the raw config.json contents. The tool extracts the architectural
parameters and runs the full 5-recipe Profile.
💡 Try: paste 3 popular 7-8B models (Meta-Llama-3-8B, Mistral-7B-v0.1, Qwen/Qwen2.5-7B), pick recipe X-2, T_eval=16000. See which best handles long context.
🆚 Compare models side-by-sideSame recipe, multiple models. Pick 2-3 candidate models and
one recipe. See verdicts in a single comparison table.
Use case: "I need long-context retrieval at 16K — which is
best: Llama-3-8B, Mistral-7B, or Qwen-7B?" → pick 3 + X-2 + 16K → see winner.
For technicians: when choosing between 2-3 candidate models for
a specific deployment scenario. Compare their verdicts on the same recipe.
For X-2 / X-19 only. The context length all compared models will be
evaluated at. Other recipes use their own params.
Models to compare (add up to 3)
❓ Your question
🩺 Diagnose CLI Command BuilderMeasure γ_obs (not predict). The browser tool predicts γ from
config alone (Padé). To measure the actual decay on a real model
you need GPU + Python. This builder produces the exact CLI command you
run locally; the script is shipped in this repository at
cli/diagnose_model.py.
Output: γ_obs, R², phase, KV cache budget D_90, KL anomaly,
full thermodynamic profile (Z, U, S, F, C_V, χ). Saved as JSON.
Pick options below and copy-paste the generated command on your local
machine (Python + transformers + numpy). Total wall time ≈ 5 min in
--fast mode on CPU; full mode 20–60 min on GPU.
Generated command:
Next steps:
(1) git clone https://github.com/karlesmarin/tafagent
(2) cd tafagent && pip install torch transformers numpy
(3) Run the command above.
(4) Result JSON lands in ./diagnose_results/ — upload it
to the 📋 Pick recipe mode (or paste in 🔍 Inspect config) for full TAF analysis.
📊 Phase diagram (γ × θ)
Each dot is one model from the paper's empirical panel
(data/master_gamma_results.json). The x-axis is RoPE base θ
on log scale; y-axis is measured γ.
The Hagedorn line γ=1 separates Phase A (γ<1, global) from
Phase B (γ>1, local-collapsed).
Hover dots for details; click to populate the recipe form.
23 models in the panel; the Padé curve (line) is
γ_pred(θ) = (2θ−T√2)/(2θ+T√2) at T=2000.
🪟 Context Unmasker
Paste a HuggingFace model id (or raw config.json). The tool checks for
sliding-window attention, RoPE scaling (YaRN/linear/dynamic NTK), and
GQA — anything that makes max_position_embeddings larger
than the practical effective context. Mistral-7B-v0.1 is the canonical
example: declared 32k, attends within ~4-8k.
Are you about to spend money on a model that won't actually attend that far? Paste an id and find out in 1 second. No GPU, no inference — just config.json arithmetic.
Or paste raw config.json (private / in-dev models)
📜 Chat-template Sniffer
Paste an HF model id (or raw tokenizer_config.json). Detects the
chat-template family (Llama-3, ChatML, Mistral, Gemma, Phi-3,
Alpaca, DeepSeek, custom) and gives you the exact framework command
to use it correctly. lm-eval-harness silently halves accuracy if you
forget to apply it (issue #1841).
Did you forget --apply_chat_template? Most multi-turn evals fail by ~50% because the chat template wasn't applied. Paste a model id, get the exact CLI flag for your stack.
Or paste raw tokenizer_config.json (private models)
🎯 Arena-Elo CI Reconstructor
Chatbot Arena strips confidence intervals from the public leaderboard.
A 5-Elo gap can be statistically meaningless. Paste raw vote data
(model_a, model_b, winner) — the tool computes Bradley-Terry MLE +
bootstrap CIs and lists statistical ties (CI overlap).
Is GPT-4 actually better than Claude — or are they tied? Paste pairwise vote CSV (or click Load sample). Bradley-Terry MLE + 200-iteration bootstrap → ranked Elos with 95% CIs and statistical-tie detection. All in browser.
🧪 Contamination Prior
Computes a Bayesian-ish prior on whether a benchmark score is contaminated, based on (model training cutoff date) × (benchmark release date) × (known corpus inclusion + leak history). Open LLM Leaderboard v1 was killed in 2024 after MMLU/HellaSwag scores became contaminated.
Should you trust your model's MMLU score? Enter the model's training cutoff date — the tool rates 20+ popular benchmarks (MMLU, HellaSwag, GSM8K, HumanEval, IFEval, MMLU-Pro, GPQA…) and tells you which scores are likely contaminated.
📋 Recipe
🎯 Inputs
📊 Verdict
🔍 Computation Chain
Every number below is deterministic Python. Click a step to expand.
💬 Plain-English Answer
📇 TAF Card — full model profile
🆚 Comparison Table
📂 Import a shared TAF result
Got a JSON file from someone else's TAF analysis? Load it here to see the verdict + chain locally.
Same view as if you'd run it yourself.
🌐 Recent community submissions
Live feed from the public registry. Click any submission to view full analysis.
Browse all →
Loading...
🔬 Paper predictions — falsification status
The TAF framework rests on falsifiable predictions (F1-F23). Each is empirically tested.
Here's the live status of every prediction in the paper.