TAF Agent — Test ANY Transformer LLM in Your Browser

🔬 TAF Agent

Diagnose any transformer LLM in 30 seconds. Free. No GPU. No signup.

Predicts whether a model will work for your use case before you spend money or time. Everything runs in your browser — your inputs never leave this tab.

Built by an independent researcher. Open source. Not affiliated with any model vendor.

⏳ Loading Python runtime...

🎯 What do you want to do?

Pick a task. Each one opens the right tool below. Or scroll down for the full list of 14 modes.

🔬 Diagnose a model Start here when you have a specific model id and want a full diagnostic: Profile runs all 5 recipes at once. Unmask checks if max_position_embeddings is honest. NIAH→Reason predicts retrieval-vs-reasoning gap. Quant predicts whether quantizing will break it. Inspect lets you paste raw config.json for private/in-dev models.

Will this specific model work for my use case?

✓ Trust a benchmark score When you see a score and want to know if it's real. Contamination rates 20+ benchmarks for likelihood the model saw them during training. Drift tells you if a gap between two evals is numerical noise or a real bug (chat-template mismatch, KV-cache layout, etc.). Arena CI reconstructs the confidence intervals Chatbot Arena hides — many top-Elo "wins" are statistically tied.

Should I believe this number? Bug or noise?

⚙️ Set up an eval correctly Before you run lm-eval-harness or vLLM serve, get the right CLI flag. Chat-template Sniffer detects the template family (Llama-3 / ChatML / Mistral / Phi-3 / DeepSeek / Alpaca / custom / none) and emits the exact `--apply_chat_template` / `--chat-template` invocation. Solves issue #1841 in lm-eval-harness (silent ÷2 accuracy). Diagnose CLI generates the Python command to measure γ_obs on your local GPU.

Get the exact CLI flag for lm-eval / vLLM / transformers.

🆚 Compare models Compare: pick 2-3 candidate models + one recipe, see verdicts in a side-by-side table (e.g. Llama-3-8B vs Mistral-7B at 32k context). Phase diagram: scatter of 23 empirical models on the (log θ, γ) plane, with the Padé curve overlaid. Hover dots for details, click to load that model into the Recipe form.

Side-by-side, or browse the empirical model landscape.

📋 Manual / free-form Recipe: pick a specific X-N recipe (X-1 custom-vs-API, X-2 long context, X-3 budget, X-5 hardware, X-19 KV compression, X-21 imprint, X-22 compute-context invariant, X-23 IH-phase) and fill the form by hand for full control. Ask: type a free-form question; an in-browser 0.5B LLM (Qwen2.5) picks the right recipe and runs it. Best for "what would happen if..." exploration.

Pick a specific recipe by hand, or ask in plain English.

🎯 Mode 7 modes available. Most users want 📇 Profile (one-click full diagnosis).
📇 Profile: paste a model id → 5-recipe TAF Card.
🆚 Compare: 2-3 models side-by-side on one recipe.
🔍 Inspect: paste raw config.json to debug parameters.
💬 Ask: free-form question, browser LLM picks the recipe.
📋 Recipe: manual selection with full form control.
🩺 Diagnose CLI: generate Python command to measure γ on real weights.
📊 Phase diagram: explore 23 panel models on (log θ, γ) plane.

Quickest start: paste any HuggingFace model id (e.g. meta-llama/Meta-Llama-3-8B), click Profile. See all 5 recipes scored in seconds.

📇 Profile a model One-click full diagnosis. Paste any HF model id (or pick preset). Tool runs all 5 recipes (long-context, KV-compression, custom-vs-API, budget, hardware) and produces a single TAF Card showing verdict per dimension + key numbers + architecture classification.

Use case: "I'm evaluating Qwen2.5-32B for production — what's its full viability profile?" → paste id → Profile → done.

For technicians: when you need a complete viability snapshot of a candidate model. Outputs match paper §sec:gamma_decomposition format.

Preset:

HF model id:

θ (rope_theta) RoPE base frequency from config.rope_theta.

T_train Max training context. From max_position_embeddings.

T_eval (your target)

n_attention_heads Number of attention heads per layer. From num_attention_heads.

n_kv_heads

head_dim Per-head dimension. Typical 64, 96, 128. From head_dim or hidden_size / num_attention_heads.

n_layers Number of transformer blocks. From num_hidden_layers.

n_params (e.g. 8e9)

Has SWA? Sliding Window Attention. true for Mistral, gemma-2, phi-3. Calibration audit (v0.5.3) disabled the historical δ_SWA correction (n=1 fit).

📂 Import a shared TAF result

Got a JSON file from someone else's TAF analysis? Load it here to see the verdict + chain locally. Same view as if you'd run it yourself.

🌐 Recent community submissions

Live feed from the public registry. Click any submission to view full analysis. Browse all →

🔬 Paper predictions — falsification status

The TAF framework rests on falsifiable predictions (F1-F23). Each is empirically tested. Here's the live status of every prediction in the paper.

🔬 TAF Agent

🎯 What do you want to do?

🔍 Architecture Inspector Paste any config.json directly. Tool parses it and runs the full Profile. Useful for: private models, in-development configs, models not yet on HuggingFace, or comparing what your custom architecture would do.

🆚 Compare models side-by-side Same recipe, multiple models. Pick 2-3 candidate models and one recipe. See verdicts in a single comparison table.

Use case: "I need long-context retrieval at 16K — which is best: Llama-3-8B, Mistral-7B, or Qwen-7B?" → pick 3 + X-2 + 16K → see winner.

Models to compare (add up to 3)

❓ Your question

Generated command:

🎯 Arena-Elo CI Reconstructor Chatbot Arena strips confidence intervals from the public leaderboard. A 5-Elo gap can be statistically meaningless. Paste raw vote data (model_a, model_b, winner) — the tool computes Bradley-Terry MLE + bootstrap CIs and lists statistical ties (CI overlap).

🧭 Solutions Hub Map of every documented LLM-eval pain we know about: which tafagent mode addresses it (if any), and the best-of-breed external tools the community already trusts. Goal: full coverage. If a canonical tool exists elsewhere, we link rather than rebuild.

📋 Recipe

🎯 Inputs

📊 Verdict

🔍 Computation Chain

💬 Plain-English Answer

📇 TAF Card — full model profile

🆚 Comparison Table

📂 Import a shared TAF result

🌐 Recent community submissions

🔬 Paper predictions — falsification status

🎯 What do you want to do?

🔍 Architecture Inspector Paste any config.json directly. Tool parses it and runs the full Profile. Useful for: private models, in-development configs, models not yet on HuggingFace, or comparing what your custom architecture would do.

🆚 Compare models side-by-side Same recipe, multiple models. Pick 2-3 candidate models and one recipe. See verdicts in a single comparison table. Use case: "I need long-context retrieval at 16K — which is best: Llama-3-8B, Mistral-7B, or Qwen-7B?" → pick 3 + X-2 + 16K → see winner.

Models to compare (add up to 3)

❓ Your question

Generated command:

🎯 Arena-Elo CI Reconstructor Chatbot Arena strips confidence intervals from the public leaderboard. A 5-Elo gap can be statistically meaningless. Paste raw vote data (model_a, model_b, winner) — the tool computes Bradley-Terry MLE + bootstrap CIs and lists statistical ties (CI overlap).

🧭 Solutions Hub Map of every documented LLM-eval pain we know about: which tafagent mode addresses it (if any), and the best-of-breed external tools the community already trusts. Goal: full coverage. If a canonical tool exists elsewhere, we link rather than rebuild.

📋 Recipe

🎯 Inputs

📊 Verdict

🔍 Computation Chain

💬 Plain-English Answer

📇 TAF Card — full model profile

🆚 Comparison Table

📂 Import a shared TAF result

🌐 Recent community submissions

🔬 Paper predictions — falsification status

🆚 Compare models side-by-side Same recipe, multiple models. Pick 2-3 candidate models and one recipe. See verdicts in a single comparison table.

Use case: "I need long-context retrieval at 16K — which is best: Llama-3-8B, Mistral-7B, or Qwen-7B?" → pick 3 + X-2 + 16K → see winner.