Diagnose any transformer LLM in 30 seconds. Free. No GPU. No signup.
Predicts whether a model will work for your use case before you spend money or time. Everything runs in your browser — your inputs never leave this tab.
Built by an independent researcher. Open source. Not affiliated with any model vendor.
π TAF Agent β User Manual
What does it do?
Predicts practical viability of any transformer LLM
before you spend GPU/$. Answers questions like "will this model work at L=32K?" or
"should I train custom or use API?" using deterministic Python formulas (TAF β Thermodynamic Attention Framework).
How to use β 7 modes
π Profile: paste model id β all recipes at once = TAF Card. Best starting point.
π Compare: 2-3 models side-by-side on same recipe. Best when choosing between candidates.
π Inspect config: paste raw config.json β tool parses + runs full Profile. For private models, in-development configs, or models not yet on HF Hub.
π¬ Ask plain English: free-form question, in-browser LLM picks the recipe. Best for casual exploration.
π Recipe + form: manual selection, full parameter control. Best when you want exact control.
π Phase diagram: scatter plot of 23 panel models on (log ΞΈ, Ξ³) plane. Hagedorn line Ξ³=1 separates Phase A from Phase B. Click a dot to load that model into Recipe form.
The 8 recipes available
X-1 Custom training vs API β compares cost of training your own model vs paying for API access.
Try: "Should I train an 8B custom model or use GPT-4o for 50M tokens/month?"
Answer types: YES (custom) / NO (API) with break-even months.
X-2 Long Context Viability β predicts if a model serves a target context length reliably.
X-3 Budget pre-flight β given $ budget, what model is feasible to train?
Try: "I have $5000, what model can I train?"
Answer: GO / TINY-MODEL / MEMORY-LIMITED with concrete N (params) and D (tokens).
X-5 Hardware selection β which GPU should I use to serve at target throughput?
Try: "Cheapest hardware to serve Llama-3-8B at 10M tokens/day"
Answer: best GPU + $/Mtok + capacity vs target.
X-19 KV Compression decision β should I use soft decay, hard cutoff, or literature methods?
Try: "How to compress KV cache for Qwen2.5-7B at 32K?"
Answer: USE SOFT DECAY / USE D_f CUTOFF / USE LITERATURE METHODS / USE HARD T_train.
β v0.4 (sesiΓ³n 29 findings) β
What's new in v0.4 (sesiΓ³n 29 findings 2026-04-28): three diagnostic recipes derived from cross-model panel analysis (n=22 LLMs).
X-21 Imprint Purity Diagnostic β predicts Ξ³ on RANDOM tokens via Ξ½=β1/(2Ο); how clean is the model's RoPE prediction?
Try: "How clean is the RoPE prediction on Llama-3-8B?"
Answer: predicted Ξ³_random + purity diagnostic (CLEAN / OVER-IMPRINTED / UNDER-IMPRINTED).
Learned-imprint slope Ξ½ = β1/(2Ο): RoPE rotation period 2Ο drives a positional bias on weights, proportional to log(N_params). Even random tokens show this scaling. Ξ½ is DERIVED β not fitted (empirical err 0.3%).
X-22 Compute-Context Invariant β does Ξ³ Γ log(NΒ²Β·D) lie in panel band 51.2 Β± 16.8? Detects scaling/training anomalies.
Try: "Does Mistral-7B fit the compute-context invariant?"
Answer: K = Ξ³Β·log(NΒ²Β·D), z-score, IN-BAND or OUTLIER.
Chinchilla-attention invariant K: Ξ³ Γ log(NΒ²Β·D) β 51.2 Β± 16.8 (CV=0.329). Connects compute scaling and attention exponent into a single dimensionless number.
X-23 IH-Phase Detector β pre- or post-induction-head? Cheap probe via sign(Ξ³_text β Ξ³_random).
ΞΞ³ as IH probe: sign(Ξ³_text β Ξ³_random) > 0 βΊ post-induction-head. Cheaper than running an in-context-learning benchmark.
Ξ³-cluster on famous constants (intriguing, n=4): CodeLlama-13b Ξ³=0.382 β 1β1/Ο (golden conjugate, err 0.0003); pythia-1.4b Ξ³=0.705 β 1/β2; Llama-2-7b Ξ³=0.287 β 1β1/β2; Mistral-Nemo Ξ³=0.428 β log_10(e). Caveat: could be coincidence.
π v0.4 β New diagnostics (sesion 31)
Four new diagnostic functions derived sesion 31 (2026-04-30) from cross-of-crosses formula games + SΓ³cratic interrogation. Available in taf_browser.py Β§33.
v0.6 (2026-05-06): three new diagnostics live in the TAF Card under π¬ Diagnostics. All run in your browser; Ξ³_observed comes from the Diagnose CLI on real weights.
TAF Card layout (new in v0.6)
After clicking π Generate full profile the card shows: a hero strip on top (architecture class + meta + 3 pills: aggregate verdict β /β /β, Ξ³ headline, π§² Anti-Ising if Phase A) and four expandable sections: π Recipes (open by default β verdict per dimension), π¬ Diagnostics (key numbers, Ξ³ predicted vs observed, what-if explorer), β Verification (Sage+Lean algebraic consistency, falsification F1-F23), π Provenance & share (calibration audit + JSON download / share link / registry submit). Click any header to expand. Every variable has an inline β tooltip.
Preset list: 11 popular models curated. Just select from dropdown.
HF Hub fetch: paste any model id (e.g. Qwen/Qwen2.5-32B-Instruct),
click π₯ Fetch. Browser downloads config.json directly from HuggingFace, fills the form. Works for any public model.
Manual: fill the form fields directly with values from the model card.
π v0.7 β Anti-bullshit pack (4 new modes)
v0.7 (2026-05-06): four new modes that solve concrete pain points reported by the HuggingFace community. Each one runs in your browser with no inference β pure metadata + math.
πͺ Context Unmasker
Detects when max_position_embeddings is misleading. Mistral-7B-v0.1 declares 32k but attends within ~4-8k via SWA. Paste an HF model id β 1-second verdict (HONEST / INFLATED / SEVERELY INFLATED / YARN-EXTENDED). Catches SWA, RoPE-scaling (YaRN/linear/dynamic NTK), small-d_head + GQA. Use case: before paying GPU for 32k context, verify the model actually attends that far.
π Chat-template Sniffer
Detects which chat-template family a model uses (Llama-3 / ChatML / Mistral / Gemma / Phi-3 / Alpaca / DeepSeek / custom / none) and gives you the exact CLI flag for lm-evaluation-harness, vLLM, and transformers. Solves issue #1841 in lm-eval-harness: forgetting --apply_chat_template silently halves multi-turn accuracy. Use case: before reporting a benchmark score, confirm you applied the template correctly.
π― Arena-Elo CI Reconstructor
Chatbot Arena strips confidence intervals from its public leaderboard β a 5-Elo gap can be statistically meaningless. Paste raw pairwise vote data (model_a, model_b, winner) β Bradley-Terry MLE + 200-iteration bootstrap β ranked Elos with 95% CIs and a "statistical ties" panel listing pairs whose CIs overlap. Try the Load sample button. Use case: before declaring "model A beats model B", verify their CIs don't overlap.
π§ͺ Contamination Prior
Bayesian-ish prior on whether a benchmark score is contaminated. Enter your model's training cutoff date β tool rates 20+ popular benchmarks (MMLU, HellaSwag, GSM8K, HumanEval, IFEval, MMLU-Pro, GPQA, AIME, MATH-500, BBH, MUSRβ¦) by P(contamination) based on time gap, corpus inclusion, and known leak history. Open LLM Leaderboard v1 was killed in 2024 after MMLU/HellaSwag scores became contaminated. Use case: decide which scores to trust when comparing two models.
βοΈ Quant-regime Classifier
Predicts Ξ³-shift and ΞPPL for any (model Γ quant scheme: NF4, AWQ, GPTQ, GGUF Q4_K_M / Q5_K_M / Q8_0, int8, FP8, β¦). Architecture-aware: small d_head + aggressive GQA β more sensitive; calibrated schemes (AWQ) absorb shift better than uncalibrated (NF4). Recommends safer alternatives if a cliff is detected. Use case: before quantizing, predict whether your specific architecture Γ scheme combo will keep PPL acceptable, with a concrete switch-to suggestion otherwise.
π Cross-framework Drift Bound
Same model, different scores on different setups. Tool predicts the maximum drift admissible from numerical noise alone (dtype, framework, batch). If the observed gap exceeds it β real bug, typically chat-template mismatch (lm-eval-harness issue #1841) or KV-cache layout. Try the "Load sample" button for the canonical chat-template bug. Use case: before reporting a regression or claiming reproducibility, verify whether the gap between two evals is bigger than what numerical noise can explain.
MMLU is saturated (top 88-94%), AIME 2025 saturated within months of release, HumanEval near-saturated. Pick any benchmark and the tool returns top-3 frontier scores, spread, mean, and a verdict β saturated / near-saturated / discriminative β plus a recommended replacement (e.g. MMLU β MMLU-Pro / GPQA / HLE). Live fetch from DemandSphere AI Frontier Tracker (CC BY-NC 4.0) when reachable; baked 2026-05-05 snapshot when not. Use case: before you cite '92% on MMLU' or design an eval, check whether the benchmark still discriminates anything.
π§ Solutions Hub
tafagent as integrator, not silo. 30+ pains across 7 categories (eval reliability Β· diagnostics Β· setup Β· training Β· retrieval Β· multimodal Β· observability), each mapped to (a) the tafagent mode that addresses it, if any, and (b) the best-of-breed external tools the community already trusts (RAGAS, MTEB, HELM, MCP Schema Validator, llm-stats, llguidance, GlitchMiner, etc.). Search box matches across pain, scenario, and tool name. Use case: 'I have problem X β does tafagent solve it, and if not, who does?'
The audit chain
Every result shows the full Computation Chain β each formula step with its inputs,
output, and interpretation. Click any step to expand. Cite section numbers (Β§26.1, Β§19.1, etc.) refer
to the underlying paper for derivation.
The plain-English answer
After the deterministic chain runs, an in-browser LLM (Qwen2.5-0.5B, ~350MB cached after first load)
synthesizes a plain-English summary. The numbers above are always correct (deterministic Python);
the synthesis is LLM-generated β verify against the chain if in doubt.
Common parameters explained
ΞΈ (rope_theta): RoPE base frequency. Higher = more long-range capacity. Typical: 10000 (early), 500000 (Llama-3), 1000000 (Qwen2.5).
T_train: max context the model was trained on. From max_position_embeddings.
T_eval: your target inference context length. The key knob.
n_kv_heads < n_attention_heads: model uses GQA (Grouped Query Attention). Reduces KV memory but pushes Ξ³ toward Hagedorn.
has_SWA: model uses Sliding Window Attention (Mistral, gemma-2).
n_params: total parameter count. Threshold ~400M for induction-head emergence.
What to look for in verdicts
YES / GO β proceed with confidence; numbers support the choice.
DEGRADED / TINY-MODEL β works but with caveats; read the action.
NO / MEMORY-LIMITED β don't proceed as-is; mitigation provided.
Privacy
Everything runs in your browser. No telemetry, no analytics, no data sent anywhere. Even the LLM model
runs locally via WebGPU/WebAssembly. Your model_ids and questions never leave this page.
Cardy ΞH — entropy shift between observed and nominal context
Falsification dashboard — checks 23 specific predictions (F1βF23)
Algebraic consistency — 8 mathematical identities the model must satisfy
β Formally verified math
37 theorems machine-proven in Lean 4 + Mathlib4
Click any badge β opens the source line on GitHub
Verify yourself: lake build (β5 s after cache fetch)
π€ Export & share
JSON Β· Markdown Β· LaTeX (paper-ready)
Reproducible share link (state encoded in URL)
Submit to community registry on GitHub
π v0.7 anti-bullshit pack
πͺ Unmask β config.json claims 32k? See if it actually attends that far
π Chat-template β exact CLI flag so lm-eval doesn't silently halve your accuracy
π― Arena CI β recover the confidence intervals Chatbot Arena hides
π§ͺ Contamination β rate 20+ benchmarks for contamination probability
βοΈ Quant β predict Ξ³ shift + ΞPPL for any (model Γ quant scheme) combo
π Drift β bug or noise? Predict max admissible gap between two evals
π NIAHβReason β does your "128k context" actually reason there, or just retrieve?
π Saturation β is your benchmark still useful, or are all frontier models tied at the top?
π§ Solutions Hub β every documented pain mapped to a tafagent mode or curated external tool. Don't reinvent β find.
Architectures supported (click to expand)
β RoPE-MHA Multi-Head Attention: each token position attends through several parallel heads at once.β RoPE-GQA Grouped Query Attention: queries share fewer keys/values than heads (saves memory but pushes Ξ³ toward Hagedorn).β ALiBi Attention with Linear Biases: position info is a learned slope added to attention scores, no rotation.β AbsPE Absolute Position Embeddings: each position has a fixed learned vector added to the token embedding.β SWA Sliding Window Attention: each token only attends within a fixed local window (Mistral, gemma-2 use this).β SSM (Mamba) State Space Model: a sequence layer that maintains internal state instead of attention (Mamba, Jamba use this).β Any HuggingFace public model
β³ Loading Python runtime...
π― What do you want to do?
Pick a task. Each one opens the right tool below. Or scroll down for the full list of 14 modes.
π¬ Diagnose a modelStart here when you have a specific model id and want a full diagnostic: Profile runs all 5 recipes at once. Unmask checks if max_position_embeddings is honest. NIAHβReason predicts retrieval-vs-reasoning gap. Quant predicts whether quantizing will break it. Inspect lets you paste raw config.json for private/in-dev models.
Will this specific model work for my use case?
β Trust a benchmark scoreWhen you see a score and want to know if it's real. Contamination rates 20+ benchmarks for likelihood the model saw them during training. Drift tells you if a gap between two evals is numerical noise or a real bug (chat-template mismatch, KV-cache layout, etc.). Arena CI reconstructs the confidence intervals Chatbot Arena hides β many top-Elo "wins" are statistically tied.
Should I believe this number? Bug or noise?
βοΈ Set up an eval correctlyBefore you run lm-eval-harness or vLLM serve, get the right CLI flag. Chat-template Sniffer detects the template family (Llama-3 / ChatML / Mistral / Phi-3 / DeepSeek / Alpaca / custom / none) and emits the exact --apply_chat_template / --chat-template invocation. Solves issue #1841 in lm-eval-harness (silent Γ·2 accuracy). Diagnose CLI generates the Python command to measure Ξ³_obs on your local GPU.
Get the exact CLI flag for lm-eval / vLLM / transformers.
Side-by-side, or browse the empirical model landscape.
π Manual / free-formRecipe: pick a specific X-N recipe (X-1 custom-vs-API, X-2 long context, X-3 budget, X-5 hardware, X-19 KV compression, X-21 imprint, X-22 compute-context invariant, X-23 IH-phase) and fill the form by hand for full control. Ask: type a free-form question; an in-browser 0.5B LLM (Qwen2.5) picks the right recipe and runs it. Best for "what would happen if..." exploration.
Pick a specific recipe by hand, or ask in plain English.
Quickest start: paste any HuggingFace model id (e.g. meta-llama/Meta-Llama-3-8B),
click Profile. See all 5 recipes scored in seconds.
π‘ Quick start: pick any preset β click Generate. Or paste a model id from HF Hub trending β π₯ Fetch β Generate.
π Profile a modelOne-click full diagnosis. Paste any HF model id (or pick preset).
Tool runs all 5 recipes (long-context, KV-compression, custom-vs-API, budget,
hardware) and produces a single TAF Card showing verdict per
dimension + key numbers + architecture classification.
Use case: "I'm evaluating Qwen2.5-32B for production β
what's its full viability profile?" β paste id β Profile β done.
For technicians: when you need a complete viability snapshot
of a candidate model. Outputs match paper Β§sec:gamma_decomposition format.
π‘ Use case: you have a private model not on HF Hub, or a config you're designing. Paste the raw JSON below and get a full TAF profile.
π Architecture InspectorPaste any config.json directly. Tool parses it and runs the full Profile.
Useful for: private models, in-development configs, models not yet on HuggingFace,
or comparing what your custom architecture would do.
Paste the raw config.json contents. The tool extracts the architectural
parameters and runs the full 5-recipe Profile.
π‘ Try: paste 3 popular 7-8B models (Meta-Llama-3-8B, Mistral-7B-v0.1, Qwen/Qwen2.5-7B), pick recipe X-2, T_eval=16000. See which best handles long context.
π Compare models side-by-sideSame recipe, multiple models. Pick 2-3 candidate models and
one recipe. See verdicts in a single comparison table.
Use case: "I need long-context retrieval at 16K β which is
best: Llama-3-8B, Mistral-7B, or Qwen-7B?" β pick 3 + X-2 + 16K β see winner.
For technicians: when choosing between 2-3 candidate models for
a specific deployment scenario. Compare their verdicts on the same recipe.
For X-2 / X-19 only. The context length all compared models will be
evaluated at. Other recipes use their own params.
Output: Ξ³_obs, RΒ², phase, KV cache budget D_90, KL anomaly,
full thermodynamic profile (Z, U, S, F, C_V, Ο). Saved as JSON.
Pick options below and copy-paste the generated command on your local
machine (Python + transformers + numpy). Total wall time β 5 min in
--fast mode on CPU; full mode 20β60 min on GPU.
Generated command:
Next steps:
(1) git clone https://github.com/karlesmarin/tafagent
(2) cd tafagent && pip install torch transformers numpy
(3) Run the command above.
(4) Result JSON lands in ./diagnose_results/ β upload it
to the π Pick recipe mode (or paste in π Inspect config) for full TAF analysis.
π Phase diagram (Ξ³ Γ ΞΈ)
Each dot is one model from the paper's empirical panel
(data/master_gamma_results.json). The x-axis is RoPE base ΞΈ
on log scale; y-axis is measured Ξ³.
The Hagedorn line Ξ³=1 separates Phase A (Ξ³<1, global) from
Phase B (Ξ³>1, local-collapsed).
Hover dots for details; click to populate the recipe form.
πͺ Context Unmasker
Paste a HuggingFace model id (or raw config.json). The tool checks for
sliding-window attention, RoPE scaling (YaRN/linear/dynamic NTK), and
GQA β anything that makes max_position_embeddings larger
than the practical effective context. Mistral-7B-v0.1 is the canonical
example: declared 32k, attends within ~4-8k.
Are you about to spend money on a model that won't actually attend that far? Paste an id and find out in 1 second. No GPU, no inference β just config.json arithmetic.
Or paste raw config.json (private / in-dev models)
π Chat-template Sniffer
Paste an HF model id (or raw tokenizer_config.json). Detects the
chat-template family (Llama-3, ChatML, Mistral, Gemma, Phi-3,
Alpaca, DeepSeek, custom) and gives you the exact framework command
to use it correctly. lm-eval-harness silently halves accuracy if you
forget to apply it (issue #1841).
Did you forget --apply_chat_template? Most multi-turn evals fail by ~50% because the chat template wasn't applied. Paste a model id, get the exact CLI flag for your stack.
Or paste raw tokenizer_config.json (private models)
π― Arena-Elo CI Reconstructor
Chatbot Arena strips confidence intervals from the public leaderboard.
A 5-Elo gap can be statistically meaningless. Paste raw vote data
(model_a, model_b, winner) β the tool computes Bradley-Terry MLE +
bootstrap CIs and lists statistical ties (CI overlap).
Is GPT-4 actually better than Claude β or are they tied? Paste pairwise vote CSV (or click Load sample). Bradley-Terry MLE + 200-iteration bootstrap β ranked Elos with 95% CIs and statistical-tie detection. All in browser.
π§ͺ Contamination Prior
Computes a Bayesian-ish prior on whether a benchmark score is contaminated, based on (model training cutoff date) Γ (benchmark release date) Γ (known corpus inclusion + leak history). Open LLM Leaderboard v1 was killed in 2024 after MMLU/HellaSwag scores became contaminated.
Should you trust your model's MMLU score? Enter the model's training cutoff date β the tool rates 20+ popular benchmarks (MMLU, HellaSwag, GSM8K, HumanEval, IFEval, MMLU-Pro, GPQAβ¦) and tells you which scores are likely contaminated.
βοΈ Quant-regime Classifier
Predicts Ξ³-shift (and downstream ΞPPL) for a given (model Γ quant scheme).
Generic claims like "AWQ ~95% retention" are too vague β TAF uses
d_head, GQA ratio, SWA flag, and model size to give an architecture-specific
verdict. Solves: HF community widely reports unpredictable quant cliffs
(NF4 -2 PPL on Phi-3 but fine on Llama-3-8B).
Will quantizing your model break it? Paste an HF model id, pick a quant scheme β get predicted Ξ³-shift, expected ΞPPL band, and a recommended alternative if it's a cliff. Browser-only, no GPU, no calibration set required.
π Cross-framework Drift Bound
Same model, different scores on different setups. Is the gap noise or
a real bug? Enter two scores with their (framework, dtype, batch,
chat-template) β tool predicts the maximum allowable drift from
numerical noise alone. If observed gap exceeds it β real bug, usually
chat-template mismatch (lm-eval issue #1841) or KV-cache layout.
Your model gives 67.2 on lm-eval-hf and 65.1 on vLLM-served. Bug or noise? Enter both scores with (framework, dtype, batch, chat-template applied?). Tool predicts the noise band and flags real bugs. arxiv 2506.09501 documents this as a major eval reproducibility problem.
π NIAH β Reasoning Gap
NIAH (Needle in a Haystack) tests retrieval: "find this fact in long text". Multi-hop reasoning tests inference: "combine facts X+Y at the start with fact Z at the end". RULER paper (NVIDIA 2024) shows long-context models often pass NIAH but fail reasoning at the same context. This tool predicts both pass rates from architecture alone.
Your model claims 128k context. Will it actually reason at 64k, or just retrieve? Paste an HF model id and a target eval context β tool predicts NIAH and multi-hop reasoning pass rates, the gap, and a "safe context" where reasoning stays β₯65%.
π Benchmark Saturation Detector
MMLU is saturated (88-94% all frontier models). Reporting "92% on MMLU" is now meaningless. This tool tells you which benchmarks still discriminate frontier models, which are saturated, and what to use instead. Data: DemandSphere AI Frontier Tracker (CC BY-NC 4.0) refreshed 2026-05.
Is your benchmark still useful? Pick a benchmark to see top-3 frontier scores, spread, and a verdict (saturated / near-saturated / discriminative) plus recommended replacements.
Data: DemandSphere AI Frontier Model Tracker (CC BY-NC 4.0) Β· HF Open LLM Leaderboard v3 (open-weight historical) Β· last fetch 2026-05-05.
π§ Solutions Hub
Map of every documented LLM-eval pain we know about: which tafagent mode addresses it (if any), and the best-of-breed external tools the community already trusts. Goal: full coverage. If a canonical tool exists elsewhere, we link rather than rebuild.
Don't reinvent β find. 30+ pains mapped to tafagent modes + curated external tools. Browse by category, search by keyword, or see the gaps where new modes would help most.
π Recipe
π― Inputs
π Verdict
π Computation Chain
Every number below is deterministic Python. Click a step to expand.
π¬ Plain-English Answer
π TAF Card β full model profile
π Comparison Table
π Import a shared TAF result
Got a JSON file from someone else's TAF analysis? Load it here to see the verdict + chain locally.
Same view as if you'd run it yourself.
π Recent community submissions
Live feed from the public registry. Click any submission to view full analysis.
Browse all β
Loading...
π¬ Paper predictions β falsification status
The TAF framework rests on falsifiable predictions (F1-F23). Each is empirically tested.
Here's the live status of every prediction in the paper.