TAF Agent — Test ANY Transformer LLM in Your Browser

🔬 TAF Agent

Diagnose any transformer LLM in 30 seconds. Free. No GPU. No signup.

Predicts whether a model will work for your use case before you spend money or time. Everything runs in your browser — your inputs never leave this tab.

Built by an independent researcher. Open source. Not affiliated with any model vendor.

⏳ Loading Python runtime...

🎯 Mode 7 modes available. Most users want 📇 Profile (one-click full diagnosis).
📇 Profile: paste a model id → 5-recipe TAF Card.
🆚 Compare: 2-3 models side-by-side on one recipe.
🔍 Inspect: paste raw config.json to debug parameters.
💬 Ask: free-form question, browser LLM picks the recipe.
📋 Recipe: manual selection with full form control.
🩺 Diagnose CLI: generate Python command to measure γ on real weights.
📊 Phase diagram: explore 23 panel models on (log θ, γ) plane.

Quickest start: paste any HuggingFace model id (e.g. meta-llama/Meta-Llama-3-8B), click Profile. See all 5 recipes scored in seconds.

📇 Profile a model One-click full diagnosis. Paste any HF model id (or pick preset). Tool runs all 5 recipes (long-context, KV-compression, custom-vs-API, budget, hardware) and produces a single TAF Card showing verdict per dimension + key numbers + architecture classification.

Use case: "I'm evaluating Qwen2.5-32B for production — what's its full viability profile?" → paste id → Profile → done.

For technicians: when you need a complete viability snapshot of a candidate model. Outputs match paper §sec:gamma_decomposition format.

Preset:

HF model id:

θ (rope_theta) RoPE base frequency from config.rope_theta.

T_train Max training context. From max_position_embeddings.

T_eval (your target)

n_attention_heads Number of attention heads per layer. From num_attention_heads.

n_kv_heads

head_dim Per-head dimension. Typical 64, 96, 128. From head_dim or hidden_size / num_attention_heads.

n_layers Number of transformer blocks. From num_hidden_layers.

n_params (e.g. 8e9)

Has SWA? Sliding Window Attention. true for Mistral, gemma-2, phi-3. Calibration audit (v0.5.3) disabled the historical δ_SWA correction (n=1 fit).

📂 Import a shared TAF result

Got a JSON file from someone else's TAF analysis? Load it here to see the verdict + chain locally. Same view as if you'd run it yourself.

🌐 Recent community submissions

Live feed from the public registry. Click any submission to view full analysis. Browse all →

🔬 Paper predictions — falsification status

The TAF framework rests on falsifiable predictions (F1-F23). Each is empirically tested. Here's the live status of every prediction in the paper.

🔬 TAF Agent

🔍 Architecture Inspector Paste any config.json directly. Tool parses it and runs the full Profile. Useful for: private models, in-development configs, models not yet on HuggingFace, or comparing what your custom architecture would do.

🆚 Compare models side-by-side Same recipe, multiple models. Pick 2-3 candidate models and one recipe. See verdicts in a single comparison table.

Use case: "I need long-context retrieval at 16K — which is best: Llama-3-8B, Mistral-7B, or Qwen-7B?" → pick 3 + X-2 + 16K → see winner.

Models to compare (add up to 3)

❓ Your question

Generated command:

🎯 Arena-Elo CI Reconstructor Chatbot Arena strips confidence intervals from the public leaderboard. A 5-Elo gap can be statistically meaningless. Paste raw vote data (model_a, model_b, winner) — the tool computes Bradley-Terry MLE + bootstrap CIs and lists statistical ties (CI overlap).

📋 Recipe

🎯 Inputs

📊 Verdict

🔍 Computation Chain

💬 Plain-English Answer

📇 TAF Card — full model profile

🆚 Comparison Table

📂 Import a shared TAF result

🌐 Recent community submissions

🔬 Paper predictions — falsification status

🔍 Architecture Inspector Paste any config.json directly. Tool parses it and runs the full Profile. Useful for: private models, in-development configs, models not yet on HuggingFace, or comparing what your custom architecture would do.

🆚 Compare models side-by-side Same recipe, multiple models. Pick 2-3 candidate models and one recipe. See verdicts in a single comparison table. Use case: "I need long-context retrieval at 16K — which is best: Llama-3-8B, Mistral-7B, or Qwen-7B?" → pick 3 + X-2 + 16K → see winner.

Models to compare (add up to 3)

❓ Your question

Generated command:

🎯 Arena-Elo CI Reconstructor Chatbot Arena strips confidence intervals from the public leaderboard. A 5-Elo gap can be statistically meaningless. Paste raw vote data (model_a, model_b, winner) — the tool computes Bradley-Terry MLE + bootstrap CIs and lists statistical ties (CI overlap).

📋 Recipe

🎯 Inputs

📊 Verdict

🔍 Computation Chain

💬 Plain-English Answer

📇 TAF Card — full model profile

🆚 Comparison Table

📂 Import a shared TAF result

🌐 Recent community submissions

🔬 Paper predictions — falsification status

🆚 Compare models side-by-side Same recipe, multiple models. Pick 2-3 candidate models and one recipe. See verdicts in a single comparison table.

Use case: "I need long-context retrieval at 16K — which is best: Llama-3-8B, Mistral-7B, or Qwen-7B?" → pick 3 + X-2 + 16K → see winner.