Diagnose any transformer LLM in 30 seconds. Free. No GPU. No signup.
Predicts whether a model will work for your use case before you spend money or time. Everything runs in your browser — your inputs never leave this tab.
Built by an independent researcher. Open source. Not affiliated with any model vendor.
📘 TAF Agent — User Manual
What does it do?
Predicts practical viability of any transformer LLM
before you spend GPU/$. Answers questions like "will this model work at L=32K?" or
"should I train custom or use API?" using deterministic Python formulas (TAF — Thermodynamic Attention Framework).
How to use — 7 modes
📇 Profile: paste model id → all recipes at once = TAF Card. Best starting point.
🆚 Compare: 2-3 models side-by-side on same recipe. Best when choosing between candidates.
🔍 Inspect config: paste raw config.json → tool parses + runs full Profile. For private models, in-development configs, or models not yet on HF Hub.
💬 Ask plain English: free-form question, in-browser LLM picks the recipe. Best for casual exploration.
📋 Recipe + form: manual selection, full parameter control. Best when you want exact control.
🩺 Diagnose CLI: generate Python command to measure γ on your local machine (transformers + numpy). Fast ≈5 min CPU; full ≈20–60 min GPU. Output JSON re-uploadable via Inspect.
📊 Phase diagram: scatter plot of 23 panel models on (log θ, γ) plane. Hagedorn line γ=1 separates Phase A from Phase B. Click a dot to load that model into Recipe form.
The 8 recipes available
X-1 Custom training vs API — compares cost of training your own model vs paying for API access.
Try: "Should I train an 8B custom model or use GPT-4o for 50M tokens/month?"
Answer types: YES (custom) / NO (API) with break-even months.
X-2 Long Context Viability — predicts if a model serves a target context length reliably.
Try: "Will Meta-Llama-3-8B handle 32000 tokens for retrieval?"
Chains: γ_Padé → decomposition → d_horizon → NIAH ceiling → hallucination → KV memory.
Verdict: YES / DEGRADED / NO with mitigation if needed.
X-3 Budget pre-flight — given $ budget, what model is feasible to train?
Try: "I have $5000, what model can I train?"
Answer: GO / TINY-MODEL / MEMORY-LIMITED with concrete N (params) and D (tokens).
X-5 Hardware selection — which GPU should I use to serve at target throughput?
Try: "Cheapest hardware to serve Llama-3-8B at 10M tokens/day"
Answer: best GPU + $/Mtok + capacity vs target.
X-19 KV Compression decision — should I use soft decay, hard cutoff, or literature methods?
Try: "How to compress KV cache for Qwen2.5-7B at 32K?"
Answer: USE SOFT DECAY / USE D_f CUTOFF / USE LITERATURE METHODS / USE HARD T_train.
— v0.4 (sesión 29 findings) —
What's new in v0.4 (sesión 29 findings 2026-04-28): three diagnostic recipes derived from cross-model panel analysis (n=22 LLMs).
X-21 Imprint Purity Diagnostic — predicts γ on RANDOM tokens via ν=−1/(2π); how clean is the model's RoPE prediction?
Try: "How clean is the RoPE prediction on Llama-3-8B?"
Answer: predicted γ_random + purity diagnostic (CLEAN / OVER-IMPRINTED / UNDER-IMPRINTED).
Learned-imprint slope ν = −1/(2π): RoPE rotation period 2π drives a positional bias on weights, proportional to log(N_params). Even random tokens show this scaling. ν is DERIVED — not fitted (empirical err 0.3%).
X-22 Compute-Context Invariant — does γ × log(N²·D) lie in panel band 51.2 ± 16.8? Detects scaling/training anomalies.
Try: "Does Mistral-7B fit the compute-context invariant?"
Answer: K = γ·log(N²·D), z-score, IN-BAND or OUTLIER.
Chinchilla-attention invariant K: γ × log(N²·D) ≈ 51.2 ± 16.8 (CV=0.329). Connects compute scaling and attention exponent into a single dimensionless number.
X-23 IH-Phase Detector — pre- or post-induction-head? Cheap probe via sign(γ_text − γ_random).
Δγ as IH probe: sign(γ_text − γ_random) > 0 ⟺ post-induction-head. Cheaper than running an in-context-learning benchmark.
γ-cluster on famous constants (intriguing, n=4): CodeLlama-13b γ=0.382 ≈ 1−1/φ (golden conjugate, err 0.0003); pythia-1.4b γ=0.705 ≈ 1/√2; Llama-2-7b γ=0.287 ≈ 1−1/√2; Mistral-Nemo γ=0.428 ≈ log_10(e). Caveat: could be coincidence.
🆕 v0.4 — New diagnostics (sesion 31)
Four new diagnostic functions derived sesion 31 (2026-04-30) from cross-of-crosses formula games + Sócratic interrogation. Available in taf_browser.py §33.
Architectural Concentration — γ_text ≈ γ_Padé − 0.012·n_kv. Cross-panel correlational law (R²=0.30). Caveat: not per-model predictor.
PDI — Padé Deviation Index — PDI = d_horizon_obs/T_eval. Traffic light: green (≈1), orange (>>1), yellow (<<1), red (Phase B negative).
v0.6 (2026-05-06): three new diagnostics live in the TAF Card under 🔬 Diagnostics. All run in your browser; γ_observed comes from the Diagnose CLI on real weights.
TAF Card layout (new in v0.6)
After clicking 🚀 Generate full profile the card shows: a hero strip on top (architecture class + meta + 3 pills: aggregate verdict ✅/⚠/❌, γ headline, 🧲 Anti-Ising if Phase A) and four expandable sections: 📋 Recipes (open by default — verdict per dimension), 🔬 Diagnostics (key numbers, γ predicted vs observed, what-if explorer), ✓ Verification (Sage+Lean algebraic consistency, falsification F1-F23), 📂 Provenance & share (calibration audit + JSON download / share link / registry submit). Click any header to expand. Every variable has an inline ⓘ tooltip.
γ predicted vs observed
Enter the empirically-measured γ from your model and the tool computes η = θ_eff_obs / θ_eff_Padé and classifies into one of 5 regimes:
Normal (η ∈ [0.85, 1.15]) — model uses its full nominal context. Use case: validate a new release before adopting it.
Fraud (η < 0.01) — nominal θ inflated; model behaves as if θ ≪ advertised. Use case: detect YaRN/marketing inflation (CodeLlama / Mistral-Nemo pattern).
Compressed (η < 0.5) — context compressed; model attends shorter than nominal θ. Use case: spot RLHF/instruction-tuning compression (LLaMA-2 pattern).
Over-Padé (η > 1.5) — model attends farther than Padé predicts. Use case: identify Lerch-corrected regime or undertrained early checkpoints (pythia-1b pattern).
SWA random-corpus (γ_obs > 1.05 with random_corpus=Yes) — sliding-window attention signature. Use case: confirm Mistral / Gemma SWA on random tokens.
Cardy ΔH diagnostic
ΔH_Cardy = log(θ_eff_obs / θ_nominal). Entropy shift between observed effective θ and nominal θ. Strong negative = compression entropy; near zero = nominal match. Complements η for borderline cases.
Lean + Mathlib verification badges
TAF identities (Anti-Ising, D-SAGE-1 quadratic, Padé z-substitution, etc.) are formally machine-proven in Lean Mathlib4. Source: github.com/karlesmarin/lean-taf. Anyone can clone + lake build to re-verify. The 🧲 Anti-Ising pill in the hero strip is one such badge.
Variable glossary (also embedded in TAF Card)
Every variable in the TAF Card has an inline ⓘ tooltip. The complete list: γ, γ_Padé, γ_decomposed, γ_observed, θ, θ_eff_obs, θ_eff_Padé, η, ΔH_Cardy, χ, d_horizon, L_NIAH, KV memory, regime. Hover any ⓘ for the definition + paper section.
Adding new models (3 ways)
Preset list: 11 popular models curated. Just select from dropdown.
HF Hub fetch: paste any model id (e.g. Qwen/Qwen2.5-32B-Instruct),
click 📥 Fetch. Browser downloads config.json directly from HuggingFace, fills the form. Works for any public model.
Manual: fill the form fields directly with values from the model card.
🆕 v0.7 — Anti-bullshit pack (4 new modes)
v0.7 (2026-05-06): four new modes that solve concrete pain points reported by the HuggingFace community. Each one runs in your browser with no inference — pure metadata + math.
🪟 Context Unmasker
Detects when max_position_embeddings is misleading. Mistral-7B-v0.1 declares 32k but attends within ~4-8k via SWA. Paste an HF model id → 1-second verdict (HONEST / INFLATED / SEVERELY INFLATED / YARN-EXTENDED). Catches SWA, RoPE-scaling (YaRN/linear/dynamic NTK), small-d_head + GQA. Use case: before paying GPU for 32k context, verify the model actually attends that far.
📜 Chat-template Sniffer
Detects which chat-template family a model uses (Llama-3 / ChatML / Mistral / Gemma / Phi-3 / Alpaca / DeepSeek / custom / none) and gives you the exact CLI flag for lm-evaluation-harness, vLLM, and transformers. Solves issue #1841 in lm-eval-harness: forgetting --apply_chat_template silently halves multi-turn accuracy. Use case: before reporting a benchmark score, confirm you applied the template correctly.
🎯 Arena-Elo CI Reconstructor
Chatbot Arena strips confidence intervals from its public leaderboard — a 5-Elo gap can be statistically meaningless. Paste raw pairwise vote data (model_a, model_b, winner) → Bradley-Terry MLE + 200-iteration bootstrap → ranked Elos with 95% CIs and a "statistical ties" panel listing pairs whose CIs overlap. Try the Load sample button. Use case: before declaring "model A beats model B", verify their CIs don't overlap.
🧪 Contamination Prior
Bayesian-ish prior on whether a benchmark score is contaminated. Enter your model's training cutoff date → tool rates 20+ popular benchmarks (MMLU, HellaSwag, GSM8K, HumanEval, IFEval, MMLU-Pro, GPQA, AIME, MATH-500, BBH, MUSR…) by P(contamination) based on time gap, corpus inclusion, and known leak history. Open LLM Leaderboard v1 was killed in 2024 after MMLU/HellaSwag scores became contaminated. Use case: decide which scores to trust when comparing two models.
⚖️ Quant-regime Classifier
Predicts γ-shift and ΔPPL for any (model × quant scheme: NF4, AWQ, GPTQ, GGUF Q4_K_M / Q5_K_M / Q8_0, int8, FP8, …). Architecture-aware: small d_head + aggressive GQA → more sensitive; calibrated schemes (AWQ) absorb shift better than uncalibrated (NF4). Recommends safer alternatives if a cliff is detected. Use case: before quantizing, predict whether your specific architecture × scheme combo will keep PPL acceptable, with a concrete switch-to suggestion otherwise.
🔀 Cross-framework Drift Bound
Same model, different scores on different setups. Tool predicts the maximum drift admissible from numerical noise alone (dtype, framework, batch). If the observed gap exceeds it → real bug, typically chat-template mismatch (lm-eval-harness issue #1841) or KV-cache layout. Try the "Load sample" button for the canonical chat-template bug. Use case: before reporting a regression or claiming reproducibility, verify whether the gap between two evals is bigger than what numerical noise can explain.
🔍 NIAH → Reasoning Gap
RULER paper (NVIDIA 2024) shows that long-context models often pass NIAH (needle retrieval) but fail multi-hop reasoning at the same context. Tool predicts both pass rates from architecture (γ_Padé + d_horizon + arch pressure: small d_head, GQA, SWA), reports the gap, and finds your model's "safe reasoning context" where reasoning stays ≥65%. Sweep mode shows the curve across 1k/4k/16k/64k/T_train. Use case: before deploying at the claimed context, find out whether the model will actually reason there or just retrieve.
The audit chain
Every result shows the full Computation Chain — each formula step with its inputs,
output, and interpretation. Click any step to expand. Cite section numbers (§26.1, §19.1, etc.) refer
to the underlying paper for derivation.
The plain-English answer
After the deterministic chain runs, an in-browser LLM (Qwen2.5-0.5B, ~350MB cached after first load)
synthesizes a plain-English summary. The numbers above are always correct (deterministic Python);
the synthesis is LLM-generated — verify against the chain if in doubt.
Common parameters explained
θ (rope_theta): RoPE base frequency. Higher = more long-range capacity. Typical: 10000 (early), 500000 (Llama-3), 1000000 (Qwen2.5).
T_train: max context the model was trained on. From max_position_embeddings.
T_eval: your target inference context length. The key knob.
n_kv_heads < n_attention_heads: model uses GQA (Grouped Query Attention). Reduces KV memory but pushes γ toward Hagedorn.
has_SWA: model uses Sliding Window Attention (Mistral, gemma-2).
n_params: total parameter count. Threshold ~400M for induction-head emergence.
What to look for in verdicts
YES / GO — proceed with confidence; numbers support the choice.
DEGRADED / TINY-MODEL — works but with caveats; read the action.
NO / MEMORY-LIMITED — don't proceed as-is; mitigation provided.
Privacy
Everything runs in your browser. No telemetry, no analytics, no data sent anywhere. Even the LLM model
runs locally via WebGPU/WebAssembly. Your model_ids and questions never leave this page.
Custom train vs API: which is cheaper for your traffic?
Long context: will it handle 32k / 128k tokens reliably?
Budget: with $X, what model can you train from scratch?
Hardware: which GPU to serve N tokens/day?
KV cache: how to compress without breaking quality?
Imprint purity: how clean is the model's positional encoding?
Compute-context: does the model fit the empirical band?
IH-phase: pre- or post-induction-head?
🔬 Diagnostics
γ predicted vs observed — auto-classifies the model into 5 regimes (normal · fraud / inflated context · compressed · over-Padé · sliding-window)
Cardy ΔH — entropy shift between observed and nominal context
Falsification dashboard — checks 23 specific predictions (F1–F23)
Algebraic consistency — 8 mathematical identities the model must satisfy
✓ Formally verified math
37 theorems machine-proven in Lean 4 + Mathlib4
Click any badge → opens the source line on GitHub
Verify yourself: lake build (≈5 s after cache fetch)
📤 Export & share
JSON · Markdown · LaTeX (paper-ready)
Reproducible share link (state encoded in URL)
Submit to community registry on GitHub
🆕 v0.7 anti-bullshit pack
🪟 Unmask — config.json claims 32k? See if it actually attends that far
📜 Chat-template — exact CLI flag so lm-eval doesn't silently halve your accuracy
🎯 Arena CI — recover the confidence intervals Chatbot Arena hides
🧪 Contamination — rate 20+ benchmarks for contamination probability
⚖️ Quant — predict γ shift + ΔPPL for any (model × quant scheme) combo
🔀 Drift — bug or noise? Predict max admissible gap between two evals
🔍 NIAH→Reason — does your "128k context" actually reason there, or just retrieve?
Architectures supported (click to expand)
✓ RoPE-MHA Multi-Head Attention: each token position attends through several parallel heads at once.✓ RoPE-GQA Grouped Query Attention: queries share fewer keys/values than heads (saves memory but pushes γ toward Hagedorn).✓ ALiBi Attention with Linear Biases: position info is a learned slope added to attention scores, no rotation.✓ AbsPE Absolute Position Embeddings: each position has a fixed learned vector added to the token embedding.✓ SWA Sliding Window Attention: each token only attends within a fixed local window (Mistral, gemma-2 use this).✓ SSM (Mamba) State Space Model: a sequence layer that maintains internal state instead of attention (Mamba, Jamba use this).✓ Any HuggingFace public model
⏳ Loading Python runtime...
🎯 What do you want to do?
Pick a task. Each one opens the right tool below. Or scroll down for the full list of 14 modes.
🔬 Diagnose a modelStart here when you have a specific model id and want a full diagnostic: Profile runs all 5 recipes at once. Unmask checks if max_position_embeddings is honest. NIAH→Reason predicts retrieval-vs-reasoning gap. Quant predicts whether quantizing will break it. Inspect lets you paste raw config.json for private/in-dev models.
Will this specific model work for my use case?
✓ Trust a benchmark scoreWhen you see a score and want to know if it's real. Contamination rates 20+ benchmarks for likelihood the model saw them during training. Drift tells you if a gap between two evals is numerical noise or a real bug (chat-template mismatch, KV-cache layout, etc.). Arena CI reconstructs the confidence intervals Chatbot Arena hides — many top-Elo "wins" are statistically tied.
Should I believe this number? Bug or noise?
⚙️ Set up an eval correctlyBefore you run lm-eval-harness or vLLM serve, get the right CLI flag. Chat-template Sniffer detects the template family (Llama-3 / ChatML / Mistral / Phi-3 / DeepSeek / Alpaca / custom / none) and emits the exact --apply_chat_template / --chat-template invocation. Solves issue #1841 in lm-eval-harness (silent ÷2 accuracy). Diagnose CLI generates the Python command to measure γ_obs on your local GPU.
Get the exact CLI flag for lm-eval / vLLM / transformers.
🆚 Compare modelsCompare: pick 2-3 candidate models + one recipe, see verdicts in a side-by-side table (e.g. Llama-3-8B vs Mistral-7B at 32k context). Phase diagram: scatter of 23 empirical models on the (log θ, γ) plane, with the Padé curve overlaid. Hover dots for details, click to load that model into the Recipe form.
Side-by-side, or browse the empirical model landscape.
📋 Manual / free-formRecipe: pick a specific X-N recipe (X-1 custom-vs-API, X-2 long context, X-3 budget, X-5 hardware, X-19 KV compression, X-21 imprint, X-22 compute-context invariant, X-23 IH-phase) and fill the form by hand for full control. Ask: type a free-form question; an in-browser 0.5B LLM (Qwen2.5) picks the right recipe and runs it. Best for "what would happen if..." exploration.
Pick a specific recipe by hand, or ask in plain English.
🎯 Mode7 modes available. Most users want 📇 Profile (one-click full diagnosis). 📇 Profile: paste a model id → 5-recipe TAF Card. 🆚 Compare: 2-3 models side-by-side on one recipe. 🔍 Inspect: paste raw config.json to debug parameters. 💬 Ask: free-form question, browser LLM picks the recipe. 📋 Recipe: manual selection with full form control. 🩺 Diagnose CLI: generate Python command to measure γ on real weights. 📊 Phase diagram: explore 23 panel models on (log θ, γ) plane.
Quickest start: paste any HuggingFace model id (e.g. meta-llama/Meta-Llama-3-8B),
click Profile. See all 5 recipes scored in seconds.
💡 Quick start: pick any preset → click Generate. Or paste a model id from HF Hub trending → 📥 Fetch → Generate.
📇 Profile a modelOne-click full diagnosis. Paste any HF model id (or pick preset).
Tool runs all 5 recipes (long-context, KV-compression, custom-vs-API, budget,
hardware) and produces a single TAF Card showing verdict per
dimension + key numbers + architecture classification.
Use case: "I'm evaluating Qwen2.5-32B for production —
what's its full viability profile?" → paste id → Profile → done.
For technicians: when you need a complete viability snapshot
of a candidate model. Outputs match paper §sec:gamma_decomposition format.
💡 Use case: you have a private model not on HF Hub, or a config you're designing. Paste the raw JSON below and get a full TAF profile.
🔍 Architecture InspectorPaste any config.json directly. Tool parses it and runs the full Profile.
Useful for: private models, in-development configs, models not yet on HuggingFace,
or comparing what your custom architecture would do.
Paste the raw config.json contents. The tool extracts the architectural
parameters and runs the full 5-recipe Profile.
💡 Try: paste 3 popular 7-8B models (Meta-Llama-3-8B, Mistral-7B-v0.1, Qwen/Qwen2.5-7B), pick recipe X-2, T_eval=16000. See which best handles long context.
🆚 Compare models side-by-sideSame recipe, multiple models. Pick 2-3 candidate models and
one recipe. See verdicts in a single comparison table.
Use case: "I need long-context retrieval at 16K — which is
best: Llama-3-8B, Mistral-7B, or Qwen-7B?" → pick 3 + X-2 + 16K → see winner.
For technicians: when choosing between 2-3 candidate models for
a specific deployment scenario. Compare their verdicts on the same recipe.
For X-2 / X-19 only. The context length all compared models will be
evaluated at. Other recipes use their own params.
Models to compare (add up to 3)
❓ Your question
🩺 Diagnose CLI Command BuilderMeasure γ_obs (not predict). The browser tool predicts γ from
config alone (Padé). To measure the actual decay on a real model
you need GPU + Python. This builder produces the exact CLI command you
run locally; the script is shipped in this repository at
cli/diagnose_model.py.
Output: γ_obs, R², phase, KV cache budget D_90, KL anomaly,
full thermodynamic profile (Z, U, S, F, C_V, χ). Saved as JSON.
Pick options below and copy-paste the generated command on your local
machine (Python + transformers + numpy). Total wall time ≈ 5 min in
--fast mode on CPU; full mode 20–60 min on GPU.
Generated command:
Next steps:
(1) git clone https://github.com/karlesmarin/tafagent
(2) cd tafagent && pip install torch transformers numpy
(3) Run the command above.
(4) Result JSON lands in ./diagnose_results/ — upload it
to the 📋 Pick recipe mode (or paste in 🔍 Inspect config) for full TAF analysis.
📊 Phase diagram (γ × θ)
Each dot is one model from the paper's empirical panel
(data/master_gamma_results.json). The x-axis is RoPE base θ
on log scale; y-axis is measured γ.
The Hagedorn line γ=1 separates Phase A (γ<1, global) from
Phase B (γ>1, local-collapsed).
Hover dots for details; click to populate the recipe form.
23 models in the panel; the Padé curve (line) is
γ_pred(θ) = (2θ−T√2)/(2θ+T√2) at T=2000.
🪟 Context Unmasker
Paste a HuggingFace model id (or raw config.json). The tool checks for
sliding-window attention, RoPE scaling (YaRN/linear/dynamic NTK), and
GQA — anything that makes max_position_embeddings larger
than the practical effective context. Mistral-7B-v0.1 is the canonical
example: declared 32k, attends within ~4-8k.
Are you about to spend money on a model that won't actually attend that far? Paste an id and find out in 1 second. No GPU, no inference — just config.json arithmetic.
Or paste raw config.json (private / in-dev models)
📜 Chat-template Sniffer
Paste an HF model id (or raw tokenizer_config.json). Detects the
chat-template family (Llama-3, ChatML, Mistral, Gemma, Phi-3,
Alpaca, DeepSeek, custom) and gives you the exact framework command
to use it correctly. lm-eval-harness silently halves accuracy if you
forget to apply it (issue #1841).
Did you forget --apply_chat_template? Most multi-turn evals fail by ~50% because the chat template wasn't applied. Paste a model id, get the exact CLI flag for your stack.
Or paste raw tokenizer_config.json (private models)
🎯 Arena-Elo CI Reconstructor
Chatbot Arena strips confidence intervals from the public leaderboard.
A 5-Elo gap can be statistically meaningless. Paste raw vote data
(model_a, model_b, winner) — the tool computes Bradley-Terry MLE +
bootstrap CIs and lists statistical ties (CI overlap).
Is GPT-4 actually better than Claude — or are they tied? Paste pairwise vote CSV (or click Load sample). Bradley-Terry MLE + 200-iteration bootstrap → ranked Elos with 95% CIs and statistical-tie detection. All in browser.
🧪 Contamination Prior
Computes a Bayesian-ish prior on whether a benchmark score is contaminated, based on (model training cutoff date) × (benchmark release date) × (known corpus inclusion + leak history). Open LLM Leaderboard v1 was killed in 2024 after MMLU/HellaSwag scores became contaminated.
Should you trust your model's MMLU score? Enter the model's training cutoff date — the tool rates 20+ popular benchmarks (MMLU, HellaSwag, GSM8K, HumanEval, IFEval, MMLU-Pro, GPQA…) and tells you which scores are likely contaminated.
⚖️ Quant-regime Classifier
Predicts γ-shift (and downstream ΔPPL) for a given (model × quant scheme).
Generic claims like "AWQ ~95% retention" are too vague — TAF uses
d_head, GQA ratio, SWA flag, and model size to give an architecture-specific
verdict. Solves: HF community widely reports unpredictable quant cliffs
(NF4 -2 PPL on Phi-3 but fine on Llama-3-8B).
Will quantizing your model break it? Paste an HF model id, pick a quant scheme — get predicted γ-shift, expected ΔPPL band, and a recommended alternative if it's a cliff. Browser-only, no GPU, no calibration set required.
🔀 Cross-framework Drift Bound
Same model, different scores on different setups. Is the gap noise or
a real bug? Enter two scores with their (framework, dtype, batch,
chat-template) — tool predicts the maximum allowable drift from
numerical noise alone. If observed gap exceeds it → real bug, usually
chat-template mismatch (lm-eval issue #1841) or KV-cache layout.
Your model gives 67.2 on lm-eval-hf and 65.1 on vLLM-served. Bug or noise? Enter both scores with (framework, dtype, batch, chat-template applied?). Tool predicts the noise band and flags real bugs. arxiv 2506.09501 documents this as a major eval reproducibility problem.
🔍 NIAH → Reasoning Gap
NIAH (Needle in a Haystack) tests retrieval: "find this fact in long text". Multi-hop reasoning tests inference: "combine facts X+Y at the start with fact Z at the end". RULER paper (NVIDIA 2024) shows long-context models often pass NIAH but fail reasoning at the same context. This tool predicts both pass rates from architecture alone.
Your model claims 128k context. Will it actually reason at 64k, or just retrieve? Paste an HF model id and a target eval context — tool predicts NIAH and multi-hop reasoning pass rates, the gap, and a "safe context" where reasoning stays ≥65%.
📋 Recipe
🎯 Inputs
📊 Verdict
🔍 Computation Chain
Every number below is deterministic Python. Click a step to expand.
💬 Plain-English Answer
📇 TAF Card — full model profile
🆚 Comparison Table
📂 Import a shared TAF result
Got a JSON file from someone else's TAF analysis? Load it here to see the verdict + chain locally.
Same view as if you'd run it yourself.
🌐 Recent community submissions
Live feed from the public registry. Click any submission to view full analysis.
Browse all →
Loading...
🔬 Paper predictions — falsification status
The TAF framework rests on falsifiable predictions (F1-F23). Each is empirically tested.
Here's the live status of every prediction in the paper.