Spaces:

karlexmarin
/

taf-agent

Running

karlexmarin Claude Opus 4.7 (1M context) commited on May 7

Commit

fbf3edc

1 Parent(s): 7c80934

v0.8.1 Solutions Hub — integrator portal (30 pains × 65 external tools)

🧭 Solutions Hub mode: every documented LLM-eval pain mapped to
(a) the tafagent mode that addresses it (16 of 30 covered) and
(b) the best-of-breed external tools the community already trusts
(65 curated links across RAGAS, MTEB, HELM, MCP Schema Validator,
llm-stats, llguidance, GlitchMiner, RULER, JSONLint, FastMCP,
LangSmith, TruLens, DeepEval, etc.).

Strategy shift: tafagent as integrator, not silo. If a canonical
solution exists publicly we link, not rebuild. Round-3 + round-4
research (2026-05-07) validated this — 6 of 10 candidate pains
had production-grade tools already (skip build). Hub closes the
loop: users land here, find the right tool, regardless of who
shipped it.

Coverage: 7 categories — eval reliability · diagnostic · setup ·
training · retrieval · multimodal · observability. Each pain entry
has: tafagent_mode (or null/planned), external_tools[]
(name+url+type), best_for, not_for. Tool types: tool / leaderboard /
paper / article / docs / issue / spec / benchmark.

UI: live search across pain+scenario+tool name, accordion per
category, badges for coverage status. i18n × 4 langs (EN/ES/FR/ZH).
Help modal entry, inventory card entry, task-tile button.

Also surfaces 2 planned tafagent gaps: 🔧 PEFT Anti-Pattern Checker
(v0.8.2 candidate, peft #2115 silent fail) and JSON CoT-aware Linter
(answer-before-reasoning bug). Both browser-feasible, no current tool.

URL validation 2026-05-07: top critical URLs fetched + confirmed alive
(HF PEFT troubleshooting docs, MCP Schema Validator, RAGAS v0.4.3
13.8k★, MTEB leaderboard).

Files: data/solutions_hub.json + js/solutions_hub.js (new);
index.html + js/main.js + js/i18n.js (modified).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Files changed (5) hide show

data/solutions_hub.json +407 -0
index.html +24 -0
js/i18n.js +72 -0
js/main.js +108 -1
js/solutions_hub.js +69 -0

data/solutions_hub.json ADDED Viewed

	@@ -0,0 +1,407 @@

+{
+  "version": "0.8.1",
+  "compiled": "2026-05-07",
+  "philosophy": "tafagent as integrator, not silo. For each documented LLM-eval pain we surface: (a) the tafagent mode that addresses it, if any; (b) the best-of-breed external tools the community already trusts; (c) when to use which. Goal: complete coverage, not feature lock-in. If the canonical tool exists elsewhere we link, not rebuild.",
+  "verification_note": "All external URLs were fetched and confirmed alive on the compiled date. Treat older entries with skepticism — link rot is real. Report dead links via the GitHub issue tracker.",
+  "categories": {
+    "eval_reliability": {
+      "label": "Trust a benchmark score",
+      "icon": "✓",
+      "description": "Should I believe this number?"
+    },
+    "diagnostic": {
+      "label": "Diagnose a model",
+      "icon": "🔬",
+      "description": "Will this model work for my use case?"
+    },
+    "setup": {
+      "label": "Set up an eval correctly",
+      "icon": "⚙️",
+      "description": "Avoid silent failures before running."
+    },
+    "training": {
+      "label": "Train / fine-tune safely",
+      "icon": "🛠️",
+      "description": "Don't waste GPU time on broken setups."
+    },
+    "retrieval": {
+      "label": "RAG & retrieval quality",
+      "icon": "📚",
+      "description": "Is my retrieval actually retrieving?"
+    },
+    "multimodal": {
+      "label": "Multimodal models",
+      "icon": "🖼️",
+      "description": "Vision-language and beyond."
+    },
+    "observability": {
+      "label": "Observe & debug agents",
+      "icon": "🔭",
+      "description": "What is my agent actually doing?"
+    }
+  },
+  "entries": [
+    {
+      "id": "saturation",
+      "category": "eval_reliability",
+      "pain": "Benchmark saturation — top models all tied at 90%+, score no longer informative.",
+      "tafagent_mode": "📈 Saturation",
+      "external_tools": [
+        {"name": "DemandSphere AI Frontier Tracker", "url": "https://www.demandsphere.com/research/demandsphere-radar/ai-frontier-model-tracker/", "type": "leaderboard"},
+        {"name": "BenchLM.ai", "url": "https://benchlm.ai/", "type": "leaderboard"},
+        {"name": "LLM Stats", "url": "https://llm-stats.com/", "type": "leaderboard"}
+      ],
+      "best_for": "Quick check whether MMLU / AIME / HumanEval still discriminate frontier models in 2026.",
+      "not_for": "Predicting which model will win on a non-standard benchmark."
+    },
+    {
+      "id": "contamination",
+      "category": "eval_reliability",
+      "pain": "Benchmark contamination — model trained on the test set.",
+      "tafagent_mode": "🧪 Contamination",
+      "external_tools": [
+        {"name": "LiveBench (contamination-resistant)", "url": "https://livebench.ai/", "type": "leaderboard"},
+        {"name": "GSM8K-Platinum / contamination studies", "url": "https://thegrigorian.medium.com/when-benchmarks-lie-why-contamination-breaks-llm-evaluation-1fa335706f32", "type": "article"}
+      ],
+      "best_for": "Estimating contamination probability across 20+ public benchmarks per architecture.",
+      "not_for": "Definitive proof — needs trace inspection. Treat as prior, not certainty."
+    },
+    {
+      "id": "vendor_self_reported",
+      "category": "eval_reliability",
+      "pain": "Vendor-reported scores untrustworthy (Llama 4 mixed-quality reports).",
+      "tafagent_mode": null,
+      "external_tools": [
+        {"name": "llm-stats verified vs self-reported tags", "url": "https://llm-stats.com/benchmarks/swe-bench-verified", "type": "leaderboard"},
+        {"name": "BenchLM.ai confidence indicator", "url": "https://benchlm.ai/", "type": "leaderboard"},
+        {"name": "Vellum independent leaderboard", "url": "https://www.vellum.ai/llm-leaderboard", "type": "leaderboard"}
+      ],
+      "best_for": "Cross-checking vendor blog claims against community-verified runs before quoting.",
+      "not_for": "Models that have never been independently verified — assume vendor optimism."
+    },
+    {
+      "id": "arena_ci",
+      "category": "eval_reliability",
+      "pain": "Chatbot Arena hides confidence intervals — many top-Elo wins are statistically tied.",
+      "tafagent_mode": "🎯 Arena CI",
+      "external_tools": [
+        {"name": "LMArena leaderboard (raw)", "url": "https://lmarena.ai/", "type": "leaderboard"},
+        {"name": "Bradley-Terry methodology paper", "url": "https://arxiv.org/abs/2403.04132", "type": "paper"}
+      ],
+      "best_for": "Reconstructing 95% CIs from raw vote CSVs to flag statistical ties.",
+      "not_for": "Inferring true skill — Arena measures preference, not capability."
+    },
+    {
+      "id": "cross_drift",
+      "category": "eval_reliability",
+      "pain": "Same model, different scores on different setups — bug or noise?",
+      "tafagent_mode": "🔀 Drift",
+      "external_tools": [
+        {"name": "vLLM vs HF transformers consistency study", "url": "https://github.com/vllm-project/vllm/issues/12343", "type": "issue"}
+      ],
+      "best_for": "Predicting maximum admissible numerical gap between two evaluation frameworks.",
+      "not_for": "Identifying the exact root cause — narrows down candidates only."
+    },
+    {
+      "id": "sandbagging",
+      "category": "eval_reliability",
+      "pain": "Models can strategically underperform on capability evaluations.",
+      "tafagent_mode": null,
+      "external_tools": [
+        {"name": "AI Sandbagging paper", "url": "https://arxiv.org/abs/2406.07358", "type": "paper"},
+        {"name": "Covert sandbagging vs CoT monitoring", "url": "https://www.alphaxiv.org/overview/2508.00943", "type": "paper"}
+      ],
+      "best_for": "Awareness — knowing CoT monitoring can have up to 36% false-negative rate.",
+      "not_for": "Live detection — requires running the model and adversarial probes."
+    },
+    {
+      "id": "max_pos_embeddings_unmask",
+      "category": "diagnostic",
+      "pain": "Config claims 32k/128k context but model attends way less (SWA, YaRN).",
+      "tafagent_mode": "🪟 Unmask",
+      "external_tools": [
+        {"name": "vLLM long-context handling thread", "url": "https://github.com/vllm-project/vllm/issues/16757", "type": "issue"}
+      ],
+      "best_for": "1-second verdict (HONEST / INFLATED / SEVERELY INFLATED / YARN-EXTENDED) before paying GPU.",
+      "not_for": "Validating that the model reasons (vs. just retrieves) at the effective context — use NIAH→Reason."
+    },
+    {
+      "id": "niah_reasoning",
+      "category": "diagnostic",
+      "pain": "Long-context models pass NIAH but fail multi-hop reasoning.",
+      "tafagent_mode": "🔍 NIAH→Reason",
+      "external_tools": [
+        {"name": "NVIDIA RULER benchmark", "url": "https://github.com/NVIDIA/RULER", "type": "tool"},
+        {"name": "RULER paper / leaderboard", "url": "https://llm-stats.com/benchmarks/ruler", "type": "leaderboard"}
+      ],
+      "best_for": "Predicting NIAH and reasoning pass rates from architecture alone — no inference needed.",
+      "not_for": "Final go/no-go decision — re-test on your domain after architectural screening passes."
+    },
+    {
+      "id": "tokenizer_glitch",
+      "category": "diagnostic",
+      "pain": "Glitch tokens / merge residues break inference silently.",
+      "tafagent_mode": null,
+      "external_tools": [
+        {"name": "GlitchMiner (AAAI 2026)", "url": "https://arxiv.org/html/2601.14658v1", "type": "paper"},
+        {"name": "Tiktokenizer (browser visualization)", "url": "https://tiktokenizer.vercel.app/", "type": "tool"}
+      ],
+      "best_for": "Spotting weird tokens. ~4.3% of vocab in Llama-2 / Mistral / DeepSeek-V3 are glitches.",
+      "not_for": "Fixing them — requires finetuning or vocab patching."
+    },
+    {
+      "id": "phase_diagram",
+      "category": "diagnostic",
+      "pain": "Where does my model sit in the architecture phase space (γ × θ)?",
+      "tafagent_mode": "📊 Phase diagram",
+      "external_tools": [],
+      "best_for": "Visualizing 23 reference models and locating yours by Hagedorn line / Padé curve.",
+      "not_for": "Quantitative recipe scoring — use Profile mode instead."
+    },
+    {
+      "id": "profile",
+      "category": "diagnostic",
+      "pain": "Will this model fit my use case across all 5 recipes?",
+      "tafagent_mode": "📇 Profile",
+      "external_tools": [],
+      "best_for": "Scoring all 5 recipes (custom train vs API · long context · budget · hardware · KV cache · etc.) in one pass.",
+      "not_for": "Production deployment readiness — Profile is screening, not certification."
+    },
+    {
+      "id": "chat_template",
+      "category": "setup",
+      "pain": "Forgetting `--apply_chat_template` silently halves multi-turn accuracy.",
+      "tafagent_mode": "📜 Chat-template",
+      "external_tools": [
+        {"name": "lm-eval-harness #1841 (canonical issue)", "url": "https://github.com/EleutherAI/lm-evaluation-harness/issues/1841", "type": "issue"},
+        {"name": "HF chat-template docs", "url": "https://huggingface.co/docs/transformers/main/en/chat_templating", "type": "docs"}
+      ],
+      "best_for": "Detecting which family (Llama-3 / ChatML / Mistral / Gemma / Phi-3 / DeepSeek / Alpaca) and getting the exact CLI flag.",
+      "not_for": "Custom templates outside the 7 detected families — verify manually."
+    },
+    {
+      "id": "structured_outputs",
+      "category": "setup",
+      "pain": "JSON schema engines fail silently; CoT models commit to answer before reasoning.",
+      "tafagent_mode": null,
+      "external_tools": [
+        {"name": "llguidance (constrained decoding)", "url": "https://github.com/guidance-ai/llguidance", "type": "tool"},
+        {"name": "Outlines", "url": "https://github.com/dottxt-ai/outlines", "type": "tool"},
+        {"name": "JSONLint validator (browser)", "url": "https://jsonlint.com/json-schema", "type": "tool"},
+        {"name": "JSONSchemaBench (10K real schemas)", "url": "https://github.com/guidance-ai/jsonschemabench", "type": "benchmark"},
+        {"name": "Schema field-ordering anti-patterns explained", "url": "https://collinwilkins.com/articles/structured-output", "type": "article"}
+      ],
+      "best_for": "Constrained decoding for production. Use llguidance / Outlines / SGLang grammars for 100% schema-valid output.",
+      "not_for": "Quick prototypes — function calling is sufficient (95-99% reliable)."
+    },
+    {
+      "id": "mcp_conformance",
+      "category": "setup",
+      "pain": "MCP server schema doesn't conform to spec — clients silently break.",
+      "tafagent_mode": null,
+      "external_tools": [
+        {"name": "MCP Schema Validator (free, browser-based)", "url": "https://www.mcpserverspot.com/tools/validator", "type": "tool"},
+        {"name": "Official MCP spec", "url": "https://github.com/modelcontextprotocol/modelcontextprotocol", "type": "spec"},
+        {"name": "FastMCP 3.0 (Jan 2026)", "url": "https://github.com/jlowin/fastmcp", "type": "tool"}
+      ],
+      "best_for": "One-shot validation of tool/resource/prompt schemas before publishing an MCP server.",
+      "not_for": "Runtime testing — use the official inspector for live calls."
+    },
+    {
+      "id": "diagnose_cli",
+      "category": "setup",
+      "pain": "Need to measure γ_obs on real weights, not just predict from config.",
+      "tafagent_mode": "🩺 Diagnose CLI",
+      "external_tools": [
+        {"name": "TAF paper (Triangulum/karlesmarin)", "url": "https://github.com/karlesmarin/NeurIPS", "type": "paper"}
+      ],
+      "best_for": "Generating the exact `python cli/diagnose_model.py` command for your model.",
+      "not_for": "Browser-only diagnosis — this mode is a builder, not an executor."
+    },
+    {
+      "id": "peft_loading",
+      "category": "training",
+      "pain": "`get_peft_model()` before `PeftModel.from_pretrained()` silently loads base model — LoRA weights ignored.",
+      "tafagent_mode": null,
+      "external_tools": [
+        {"name": "HF PEFT troubleshooting (canonical)", "url": "https://huggingface.co/docs/peft/main/en/developer_guides/troubleshooting", "type": "docs"},
+        {"name": "peft #2115 — original bug report", "url": "https://github.com/huggingface/peft/issues/2115", "type": "issue"},
+        {"name": "PEFT get_layer_status() / get_model_status()", "url": "https://huggingface.co/docs/peft/main/en/package_reference/peft_model", "type": "docs"}
+      ],
+      "best_for": "If you suspect your LoRA isn't being applied, call `model.get_layer_status()` and check `active_adapters` is non-empty.",
+      "not_for": null,
+      "tafagent_planned_mode": "🔧 PEFT Anti-Pattern Checker (v0.8.2)"
+    },
+    {
+      "id": "intruder_dimensions",
+      "category": "training",
+      "pain": "LoRA introduces 'intruder dimensions' that contribute to forgetting.",
+      "tafagent_mode": null,
+      "external_tools": [
+        {"name": "PEFT reduce_intruder_dimension utility", "url": "https://huggingface.co/docs/peft/main/en/developer_guides/troubleshooting", "type": "docs"}
+      ],
+      "best_for": "Post-training cleanup if forgetting metrics regress after LoRA finetune.",
+      "not_for": "Heavy domain shift — intruder dim removal won't fix structural forgetting."
+    },
+    {
+      "id": "quant_regime",
+      "category": "training",
+      "pain": "Will quantization break my model? Which scheme for which arch?",
+      "tafagent_mode": "⚖️ Quant",
+      "external_tools": [
+        {"name": "Maarten Grootendorst quantization newsletter", "url": "https://newsletter.maartengrootendorst.com/p/which-quantization-method-is-right", "type": "article"},
+        {"name": "Jarvis Labs vLLM quantization benchmarks", "url": "https://jarvislabs.ai/blog/vllm-quantization-complete-guide-benchmarks", "type": "article"},
+        {"name": "oobabooga quant comparison (GPTQ/AWQ/EXL2/GGUF)", "url": "https://oobabooga.github.io/blog/posts/gptq-awq-exl2-llamacpp/", "type": "article"},
+        {"name": "Which Quantization (arxiv)", "url": "https://arxiv.org/pdf/2601.14277", "type": "paper"}
+      ],
+      "best_for": "Predict γ shift + ΔPPL for any (model × scheme) combo. AWQ ~95% / GGUF ~92% / GPTQ ~90% retention.",
+      "not_for": "Production quality cert — run a 10-prompt holdout eval after quantization."
+    },
+    {
+      "id": "forgetting",
+      "category": "training",
+      "pain": "Will my LoRA fine-tune destroy MMLU performance?",
+      "tafagent_mode": null,
+      "external_tools": [
+        {"name": "Scaling Laws for Forgetting (Kleiman et al.)", "url": "https://arxiv.org/html/2401.05605v1", "type": "paper"},
+        {"name": "LoRA Learns Less and Forgets Less (Biderman et al., TMLR)", "url": "https://arxiv.org/abs/2405.09673", "type": "paper"},
+        {"name": "How Much is Too Much? (LoRA Rank Trade-offs)", "url": "https://arxiv.org/html/2512.15634v1", "type": "paper"}
+      ],
+      "best_for": "Reading before any new fine-tune. Same (arch, rank) yields Δ from -10pp to +35pp on MMLU.",
+      "not_for": "A predictor — variance is too high for a closed-form heuristic. Measure your own holdout."
+    },
+    {
+      "id": "rag_eval",
+      "category": "retrieval",
+      "pain": "Is my RAG retrieval actually retrieving?",
+      "tafagent_mode": null,
+      "external_tools": [
+        {"name": "RAGAS — automated RAG eval (13.8k★)", "url": "https://github.com/explodinggradients/ragas", "type": "tool"},
+        {"name": "TruLens — feedback functions + tracing", "url": "https://www.trulens.org/", "type": "tool"},
+        {"name": "DeepEval — 50+ metrics, CI/CD ready", "url": "https://github.com/confident-ai/deepeval", "type": "tool"},
+        {"name": "RAG eval frameworks comparison", "url": "https://atlan.com/know/llm-evaluation-frameworks-compared/", "type": "article"}
+      ],
+      "best_for": "Production RAG monitoring. RAGAS for metric exploration, DeepEval for CI/CD gates, TruLens for dashboards.",
+      "not_for": "Browser-only — all three need Python + your retrieval pipeline."
+    },
+    {
+      "id": "embeddings",
+      "category": "retrieval",
+      "pain": "Which embedding model for my corpus?",
+      "tafagent_mode": null,
+      "external_tools": [
+        {"name": "MTEB Leaderboard (HF official)", "url": "https://huggingface.co/spaces/mteb/leaderboard", "type": "leaderboard"},
+        {"name": "MMTEB — 250+ langs", "url": "https://github.com/embeddings-benchmark/mteb", "type": "tool"},
+        {"name": "Best embedding models for RAG (2026)", "url": "https://blog.premai.io/best-embedding-models-for-rag-2026-ranked-by-mteb-score-cost-and-self-hosting/", "type": "article"}
+      ],
+      "best_for": "Cross-comparison of 100+ embedding models on 56 English tasks / 250+ multilingual.",
+      "not_for": "Predicting performance on your specific corpus — 'leaderboard ≠ your data'."
+    },
+    {
+      "id": "vlm_eval",
+      "category": "multimodal",
+      "pain": "Which VLM benchmark, and is my VLM actually seeing?",
+      "tafagent_mode": "📈 Saturation (covers MMMU/MMMU-Pro/VisScience)",
+      "external_tools": [
+        {"name": "MMMU benchmark", "url": "https://mmmu-benchmark.github.io/", "type": "leaderboard"},
+        {"name": "VisScience (K-12 science)", "url": "https://arxiv.org/abs/2409.13730", "type": "paper"},
+        {"name": "VLM survey 2025", "url": "https://arxiv.org/abs/2501.02189", "type": "paper"}
+      ],
+      "best_for": "MMMU near-saturated (top-3 ~85.6%); VisScience still discriminative (~46% mean) — pick the harder one.",
+      "not_for": "Visual hallucination detection — needs running the VLM with your images."
+    },
+    {
+      "id": "agent_observability",
+      "category": "observability",
+      "pain": "Why did my agent fail / loop? Can't tell from logs.",
+      "tafagent_mode": null,
+      "external_tools": [
+        {"name": "LangSmith (LangChain ecosystem)", "url": "https://www.langchain.com/langsmith/observability", "type": "tool"},
+        {"name": "LangGraph Studio v2 (May 2025)", "url": "https://www.langchain.com/", "type": "tool"},
+        {"name": "TruLens (RAG + agent traces)", "url": "https://www.trulens.org/", "type": "tool"},
+        {"name": "OpenLLMetry — OTLP-based tracing", "url": "https://github.com/traceloop/openllmetry", "type": "tool"}
+      ],
+      "best_for": "Visual trace viewer per LLM call / tool invocation / retrieval step. Token + cost tracking.",
+      "not_for": "Browser-only — all need integration into your stack."
+    },
+    {
+      "id": "instruction_following",
+      "category": "observability",
+      "pain": "Best agentic models follow <30% of instructions perfectly on real-world tasks.",
+      "tafagent_mode": null,
+      "external_tools": [
+        {"name": "AGENTIF benchmark", "url": "https://keg.cs.tsinghua.edu.cn/persons/xubin/papers/AgentIF.pdf", "type": "paper"},
+        {"name": "Tool output processing benchmark", "url": "https://arxiv.org/html/2510.15955v1", "type": "paper"}
+      ],
+      "best_for": "Calibrating expectations — performance falls with instruction length and tool constraints.",
+      "not_for": "Live testing — needs running the agent on your task suite."
+    },
+    {
+      "id": "saturation_meta_resources",
+      "category": "eval_reliability",
+      "pain": "I want to read the full state of LLM evaluation 2026.",
+      "tafagent_mode": null,
+      "external_tools": [
+        {"name": "Survey: A Survey on LLM Benchmarks (2508.15361)", "url": "https://arxiv.org/abs/2508.15361", "type": "paper"},
+        {"name": "Survey: LLMs-as-Judges (2412.05579)", "url": "https://arxiv.org/abs/2412.05579", "type": "paper"},
+        {"name": "Holistic Evaluation of Language Models (HELM)", "url": "https://crfm.stanford.edu/helm/latest/", "type": "tool"},
+        {"name": "Open LLM Leaderboard v3", "url": "https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard", "type": "leaderboard"}
+      ],
+      "best_for": "Comprehensive context on contamination, judge bias, saturation, methodology open problems.",
+      "not_for": "Quick decisions — these are surveys, not tools."
+    },
+    {
+      "id": "config_inspector",
+      "category": "diagnostic",
+      "pain": "What's actually in this model's config.json?",
+      "tafagent_mode": "🔍 Inspect config",
+      "external_tools": [
+        {"name": "LLM Config Comparer", "url": "https://huggingface.co/spaces/gojiteji/LLM-Comparer", "type": "tool"},
+        {"name": "HF Hub model card / config viewer", "url": "https://huggingface.co/", "type": "tool"}
+      ],
+      "best_for": "Paste config JSON → full TAF analysis without re-fetching.",
+      "not_for": "Comparing across N models — use 🆚 Compare or open-llm-leaderboard/comparator."
+    },
+    {
+      "id": "compare_models",
+      "category": "diagnostic",
+      "pain": "Side-by-side comparison of multiple models on multiple recipes.",
+      "tafagent_mode": "🆚 Compare models",
+      "external_tools": [
+        {"name": "Open LLM Leaderboard Comparator (HF official)", "url": "https://huggingface.co/spaces/open-llm-leaderboard/comparator", "type": "tool"}
+      ],
+      "best_for": "Quick recipe-by-recipe comparison up to 5 models.",
+      "not_for": "Production benchmark scores — use the HF comparator for benchmark results."
+    },
+    {
+      "id": "ask_plain_english",
+      "category": "diagnostic",
+      "pain": "I just want to ask a question in plain English.",
+      "tafagent_mode": "💬 Ask plain English",
+      "external_tools": [],
+      "best_for": "'Will Mistral-7B handle 16K NIAH retrieval?' → answer with the right recipe + chain.",
+      "not_for": "Open-ended chat — this is a routing front-end, not a chatbot."
+    },
+    {
+      "id": "recipe_picker",
+      "category": "diagnostic",
+      "pain": "I know my use case but not which recipe to apply.",
+      "tafagent_mode": "📋 Pick recipe",
+      "external_tools": [],
+      "best_for": "Browsing the 8 recipes (custom train vs API · long context · budget · hardware · etc.) when you don't know which fits.",
+      "not_for": "Running all of them at once — use Profile mode."
+    },
+    {
+      "id": "verified_math",
+      "category": "diagnostic",
+      "pain": "Can I trust the math behind the diagnostic?",
+      "tafagent_mode": null,
+      "external_tools": [
+        {"name": "Lean theorems (Triangulum/karlesmarin/lean-taf)", "url": "https://github.com/karlesmarin/lean-taf", "type": "spec"},
+        {"name": "TAF paper (NeurIPS)", "url": "https://github.com/karlesmarin/NeurIPS", "type": "paper"}
+      ],
+      "best_for": "37 theorems machine-proven in Lean 4 + Mathlib. Click any badge in the UI to open the source line.",
+      "not_for": "Empirical claims — Lean covers algebraic identities, not measurement protocols."
+    }
+  ]
+}

index.html CHANGED Viewed

@@ -216,6 +216,9 @@
       <p><strong data-i18n="help.v08.saturation.title">📈 Benchmark Saturation Detector</strong></p>
       <p data-i18n="help.v08.saturation.body">MMLU is saturated (top 88-94%), AIME 2025 saturated within months of release, HumanEval near-saturated. Pick any benchmark and the tool returns top-3 frontier scores, spread, mean, and a verdict — saturated / near-saturated / discriminative — plus a recommended replacement (e.g. MMLU → MMLU-Pro / GPQA / HLE). Live fetch from DemandSphere AI Frontier Tracker (CC BY-NC 4.0) when reachable; baked 2026-05-05 snapshot when not. <em>Use case</em>: before you cite '92% on MMLU' or design an eval, check whether the benchmark still discriminates anything.</p>
       <h3 data-i18n="help.audit.title">The audit chain</h3>
       <p data-i18n="help.audit.body">Every result shows the full <strong>Computation Chain</strong> — each formula step with its inputs,
       output, and interpretation. Click any step to expand. Cite section numbers (§26.1, §19.1, etc.) refer
@@ -325,6 +328,7 @@
             <li data-i18n="inv.v07.drift"><strong>🔀 Drift</strong> — bug or noise? Predict max admissible gap between two evals</li>
             <li data-i18n="inv.v07.niah"><strong>🔍 NIAH→Reason</strong> — does your "128k context" actually reason there, or just retrieve?</li>
             <li data-i18n="inv.v08.saturation"><strong>📈 Saturation</strong> — is your benchmark still useful, or are all frontier models tied at the top?</li>
           </ul>
         </details>
       </div>
@@ -383,6 +387,7 @@
             <button data-mode-link="drift" data-i18n="modes.drift">🔀 Drift</button>
             <button data-mode-link="arena" data-i18n="modes.arena">🎯 Arena CI</button>
             <button data-mode-link="saturation" data-i18n="modes.saturation">📈 Saturation</button>
           </div>
         </div>
         <div class="task-tile">
@@ -450,6 +455,7 @@
         <button class="mode-btn" data-mode="drift" role="tab" aria-selected="false" data-i18n="modes.drift">🔀 Drift</button>
         <button class="mode-btn" data-mode="niah" role="tab" aria-selected="false" data-i18n="modes.niah">🔍 NIAH→Reason</button>
         <button class="mode-btn" data-mode="saturation" role="tab" aria-selected="false" data-i18n="modes.saturation">📈 Saturation</button>
       </div>
       <p id="mode-desc" class="recipe-desc" data-i18n="modes.desc">
         <strong>Quickest start</strong>: paste any HuggingFace model id (e.g. <code>meta-llama/Meta-Llama-3-8B</code>),
@@ -997,6 +1003,24 @@
       </p>
     </section>
     <!-- Recipe selector (mode=recipe) -->
     <section id="recipe-section" style="display:none;">
       <h2 data-i18n="recipe.title">📋 Recipe</h2>

       <p><strong data-i18n="help.v08.saturation.title">📈 Benchmark Saturation Detector</strong></p>
       <p data-i18n="help.v08.saturation.body">MMLU is saturated (top 88-94%), AIME 2025 saturated within months of release, HumanEval near-saturated. Pick any benchmark and the tool returns top-3 frontier scores, spread, mean, and a verdict — saturated / near-saturated / discriminative — plus a recommended replacement (e.g. MMLU → MMLU-Pro / GPQA / HLE). Live fetch from DemandSphere AI Frontier Tracker (CC BY-NC 4.0) when reachable; baked 2026-05-05 snapshot when not. <em>Use case</em>: before you cite '92% on MMLU' or design an eval, check whether the benchmark still discriminates anything.</p>
+      <p><strong data-i18n="help.v081.hub.title">🧭 Solutions Hub</strong></p>
+      <p data-i18n="help.v081.hub.body">tafagent as integrator, not silo. 30+ pains across 7 categories (eval reliability · diagnostics · setup · training · retrieval · multimodal · observability), each mapped to (a) the tafagent mode that addresses it, if any, and (b) the best-of-breed external tools the community already trusts (RAGAS, MTEB, HELM, MCP Schema Validator, llm-stats, llguidance, GlitchMiner, etc.). Search box matches across pain, scenario, and tool name. <em>Use case</em>: 'I have problem X — does tafagent solve it, and if not, who does?'</p>
       <h3 data-i18n="help.audit.title">The audit chain</h3>
       <p data-i18n="help.audit.body">Every result shows the full <strong>Computation Chain</strong> — each formula step with its inputs,
       output, and interpretation. Click any step to expand. Cite section numbers (§26.1, §19.1, etc.) refer
             <li data-i18n="inv.v07.drift"><strong>🔀 Drift</strong> — bug or noise? Predict max admissible gap between two evals</li>
             <li data-i18n="inv.v07.niah"><strong>🔍 NIAH→Reason</strong> — does your "128k context" actually reason there, or just retrieve?</li>
             <li data-i18n="inv.v08.saturation"><strong>📈 Saturation</strong> — is your benchmark still useful, or are all frontier models tied at the top?</li>
+            <li data-i18n="inv.v081.hub"><strong>🧭 Solutions Hub</strong> — every documented pain mapped to a tafagent mode or curated external tool. Don't reinvent — find.</li>
           </ul>
         </details>
       </div>
             <button data-mode-link="drift" data-i18n="modes.drift">🔀 Drift</button>
             <button data-mode-link="arena" data-i18n="modes.arena">🎯 Arena CI</button>
             <button data-mode-link="saturation" data-i18n="modes.saturation">📈 Saturation</button>
+            <button data-mode-link="hub" data-i18n="modes.hub">🧭 Solutions</button>
           </div>
         </div>
         <div class="task-tile">
         <button class="mode-btn" data-mode="drift" role="tab" aria-selected="false" data-i18n="modes.drift">🔀 Drift</button>
         <button class="mode-btn" data-mode="niah" role="tab" aria-selected="false" data-i18n="modes.niah">🔍 NIAH→Reason</button>
         <button class="mode-btn" data-mode="saturation" role="tab" aria-selected="false" data-i18n="modes.saturation">📈 Saturation</button>
+        <button class="mode-btn" data-mode="hub" role="tab" aria-selected="false" data-i18n="modes.hub">🧭 Solutions</button>
       </div>
       <p id="mode-desc" class="recipe-desc" data-i18n="modes.desc">
         <strong>Quickest start</strong>: paste any HuggingFace model id (e.g. <code>meta-llama/Meta-Llama-3-8B</code>),
       </p>
     </section>
+    <!-- Solutions Hub — integrator portal (v0.8.1) -->
+    <section id="hub-section" style="display:none;">
+      <h2><span data-i18n="hub.title">🧭 Solutions Hub</span>
+        <span class="info"><span class="tooltip" data-i18n="hub.tip">
+          Map of every documented LLM-eval pain we know about: which tafagent mode addresses it (if any), and the best-of-breed external tools the community already trusts. Goal: full coverage. If a canonical tool exists elsewhere, we link rather than rebuild.
+        </span></span>
+      </h2>
+      <p class="recipe-desc" data-i18n="hub.desc">
+        <strong>Don't reinvent — find.</strong> 30+ pains mapped to tafagent modes + curated external tools. Browse by category, search by keyword, or see the gaps where new modes would help most.
+      </p>
+      <div class="form-row">
+        <input type="text" id="hub-search" placeholder="search: e.g. 'forgetting' or 'vendor' or 'RAG'…" style="flex:1;" />
+        <button type="button" id="hub-clear-btn" class="secondary" data-i18n="hub.clear_btn">✕ Clear</button>
+      </div>
+      <p id="hub-status" class="recipe-desc" style="font-size:0.92em;"></p>
+      <div id="hub-output" style="margin-top: 1em;"></div>
+    </section>
     <!-- Recipe selector (mode=recipe) -->
     <section id="recipe-section" style="display:none;">
       <h2 data-i18n="recipe.title">📋 Recipe</h2>

js/i18n.js CHANGED Viewed

@@ -423,6 +423,8 @@ export const TRANSLATIONS = {
     "mode_desc.niah":              "Predicts NIAH (retrieval) and multi-hop reasoning pass rates at any context. Solves: long-context models often pass NIAH but fail reasoning at the same context (RULER paper).",
     "modes.saturation":            "📈 Saturation",
     "mode_desc.saturation":        "Tells you whether a benchmark still discriminates frontier models or has saturated (e.g. MMLU 88-94% top, AIME 2025 already 96-100%). Returns top-3 + verdict + recommended replacements.",
     "niah.title":                  "🔍 NIAH → Reasoning Gap",
     "niah.tip":                    "NIAH (Needle in a Haystack) tests retrieval: 'find this fact in long text'. Multi-hop reasoning tests inference: 'combine facts X+Y at the start with fact Z at the end'. RULER paper (NVIDIA 2024) shows long-context models often pass NIAH but fail reasoning at the same context. This tool predicts both pass rates from architecture alone.",
     "niah.desc":                   "<strong>Your model claims 128k context. Will it actually reason at 64k, or just retrieve?</strong> Paste an HF model id and a target eval context — tool predicts NIAH and multi-hop reasoning pass rates, the gap, and a 'safe context' where reasoning stays ≥65%.",
@@ -501,6 +503,22 @@ export const TRANSLATIONS = {
     "help.v08.saturation.title":   "📈 Benchmark Saturation Detector",
     "help.v08.saturation.body":    "MMLU is saturated (88-94% top), AIME 2025 saturated within months of release, HumanEval near-saturated. Pick any benchmark and the tool returns top-3 frontier scores, spread, mean, and a verdict — saturated / near-saturated / discriminative — plus a recommended replacement (e.g. MMLU → MMLU-Pro / GPQA / HLE). Live fetch from DemandSphere AI Frontier Tracker (CC BY-NC 4.0) when reachable; baked 2026-05-05 snapshot when not. <em>Use case</em>: before you cite '92% on MMLU' or design an eval, check whether the benchmark still discriminates anything.",
     "inv.v08.saturation":          "<strong>📈 Saturation</strong> — is your benchmark still useful, or are all frontier models tied at the top?",
     // v0.7.7 — Task tiles (UX restructure: 14 modes grouped by user intent)
     "tiles.title":                 "🎯 What do you want to do?",
@@ -1367,6 +1385,8 @@ export const TRANSLATIONS = {
     "mode_desc.niah":              "Predice tasas de pass de NIAH (retrieval) y reasoning multi-hop a cualquier contexto. Resuelve: modelos long-context pasan NIAH pero fallan reasoning al mismo contexto (paper RULER).",
     "modes.saturation":            "📈 Saturación",
     "mode_desc.saturation":        "Te dice si un benchmark sigue discriminando frontier models o ya está saturado (ej. MMLU 88-94% top, AIME 2025 ya 96-100%). Devuelve top-3 + veredicto + reemplazos recomendados.",
     "niah.title":                  "🔍 Gap NIAH → Reasoning",
     "niah.tip":                    "NIAH (Needle in a Haystack) testea retrieval: 'encuentra este hecho en texto largo'. Reasoning multi-hop testea inferencia: 'combina hechos X+Y del principio con hecho Z del final'. El paper RULER (NVIDIA 2024) muestra que modelos long-context a menudo pasan NIAH pero fallan reasoning al mismo contexto. Esta herramienta predice ambas tasas desde la arquitectura sola.",
     "niah.desc":                   "<strong>Tu modelo dice 128k de contexto. ¿Razonará realmente a 64k, o solo encontrará?</strong> Pega un model id HF y un contexto objetivo — la herramienta predice tasas de pass NIAH y reasoning multi-hop, el gap, y un 'contexto seguro' donde reasoning se mantiene ≥65%.",
@@ -1445,6 +1465,22 @@ export const TRANSLATIONS = {
     "help.v08.saturation.title":   "📈 Detector de saturación de benchmarks",
     "help.v08.saturation.body":    "MMLU está saturado (top 88-94%), AIME 2025 saturó a los pocos meses de salir, HumanEval near-saturated. Elige cualquier benchmark y la herramienta retorna top-3 frontier scores, spread, media, y un veredicto — saturated / near-saturated / discriminative — más un reemplazo recomendado (ej. MMLU → MMLU-Pro / GPQA / HLE). Fetch en vivo desde DemandSphere AI Frontier Tracker (CC BY-NC 4.0) cuando llega; snapshot baked 2026-05-05 cuando no. <em>Caso de uso</em>: antes de citar '92% en MMLU' o diseñar una eval, verifica si el benchmark aún discrimina algo.",
     "inv.v08.saturation":          "<strong>📈 Saturation</strong> — ¿sigue siendo útil tu benchmark, o están todos los frontiers empatados arriba?",
     // v0.7.7 — Tiles de tareas (UX restructure: 14 modos agrupados por intención)
     "tiles.title":                 "🎯 ¿Qué quieres hacer?",
@@ -2175,6 +2211,8 @@ export const TRANSLATIONS = {
     "mode_desc.niah":              "Prédit les taux de réussite NIAH (retrieval) et reasoning multi-hop à n'importe quel contexte. Résout : les modèles long-context passent souvent NIAH mais échouent au reasoning au même contexte (paper RULER).",
     "modes.saturation":            "📈 Saturation",
     "mode_desc.saturation":        "Indique si un benchmark discrimine encore les frontier models ou s'il est saturé (ex. MMLU 88-94% top, AIME 2025 déjà 96-100%). Retourne top-3 + verdict + remplacements recommandés.",
     "niah.title":                  "🔍 Gap NIAH → Reasoning",
     "niah.tip":                    "NIAH (Needle in a Haystack) teste le retrieval : 'trouve ce fait dans un long texte'. Le reasoning multi-hop teste l'inférence : 'combine les faits X+Y au début avec le fait Z à la fin'. Le paper RULER (NVIDIA 2024) montre que les modèles long-context passent souvent NIAH mais échouent au reasoning au même contexte. Cet outil prédit les deux taux à partir de la seule architecture.",
     "niah.desc":                   "<strong>Votre modèle revendique 128k de contexte. Va-t-il vraiment raisonner à 64k, ou seulement retrouver ?</strong> Collez un model id HF et un contexte cible — l'outil prédit les taux de réussite NIAH et reasoning multi-hop, le gap, et un 'contexte sûr' où le reasoning reste ≥65%.",
@@ -2253,6 +2291,22 @@ export const TRANSLATIONS = {
     "help.v08.saturation.title":   "📈 Détecteur de saturation des benchmarks",
     "help.v08.saturation.body":    "MMLU est saturé (top 88-94%), AIME 2025 saturé en quelques mois après sa sortie, HumanEval presque saturé. Choisissez un benchmark et l'outil retourne top-3 frontier scores, spread, moyenne, et un verdict — saturated / near-saturated / discriminative — plus un remplacement recommandé (ex. MMLU → MMLU-Pro / GPQA / HLE). Fetch en direct depuis DemandSphere AI Frontier Tracker (CC BY-NC 4.0) si accessible ; snapshot baked 2026-05-05 sinon. <em>Cas d'usage</em> : avant de citer '92% sur MMLU' ou de concevoir une eval, vérifiez si le benchmark discrimine encore quelque chose.",
     "inv.v08.saturation":          "<strong>📈 Saturation</strong> — votre benchmark est-il encore utile, ou tous les frontiers sont-ils à égalité au sommet ?",
     // v0.7.7 — Tuiles de tâches (refonte UX : 14 modes regroupés par intention)
     "tiles.title":                 "🎯 Que voulez-vous faire ?",
@@ -2983,6 +3037,8 @@ export const TRANSLATIONS = {
     "mode_desc.niah":              "在任意上下文下预测 NIAH（检索）与多跳 reasoning 通过率。解决：长上下文模型常常通过 NIAH 但在同一上下文上 reasoning 失败（RULER 论文）。",
     "modes.saturation":            "📈 饱和度",
     "mode_desc.saturation":        "告诉你某个 benchmark 是否仍能区分 frontier 模型，或者已经饱和（例如 MMLU 88-94% 顶部，AIME 2025 已经 96-100%）。返回 top-3 + 判定 + 推荐替代品。",
     "niah.title":                  "🔍 NIAH → Reasoning Gap",
     "niah.tip":                    "NIAH（Needle in a Haystack）测试检索：\"在长文本中找到这个事实\"。多跳 reasoning 测试推理：\"把开头的事实 X+Y 与结尾的事实 Z 结合\"。RULER 论文（NVIDIA 2024）显示长上下文模型经常通过 NIAH 但在相同上下文上 reasoning 失败。本工具仅根据架构预测两种通过率。",
     "niah.desc":                   "<strong>你的模型声称 128k 上下文。它在 64k 是真的能 reasoning，还是只能检索？</strong>粘贴 HF 模型 id 和目标 eval 上下文 — 工具预测 NIAH 与多跳 reasoning 通过率、gap，以及 reasoning 保持 ≥65% 的 \"安全上下文\"。",
@@ -3061,6 +3117,22 @@ export const TRANSLATIONS = {
     "help.v08.saturation.title":   "📈 Benchmark 饱和度检测器",
     "help.v08.saturation.body":    "MMLU 已饱和（top 88-94%），AIME 2025 上线几个月就饱和，HumanEval 接近饱和。选任何 benchmark，工具返回 top-3 frontier 分数、spread、平均，以及判定 — saturated / near-saturated / discriminative — 加上推荐替代品（例如 MMLU → MMLU-Pro / GPQA / HLE）。可达时从 DemandSphere AI Frontier Tracker（CC BY-NC 4.0）实时 fetch；不可达时使用 2026-05-05 的 baked 快照。<em>用例</em>：在引用\"92% on MMLU\"或设计 eval 之前，检查 benchmark 是否仍能区分任何东西。",
     "inv.v08.saturation":          "<strong>📈 Saturation</strong> — 你的 benchmark 还有用吗，还是所有 frontier 都在顶部并列？",
     // v0.7.7 — 任务卡片（UX 重构：按用户意图分组的 14 个模式）
     "tiles.title":                 "🎯 你想做什么？",

     "mode_desc.niah":              "Predicts NIAH (retrieval) and multi-hop reasoning pass rates at any context. Solves: long-context models often pass NIAH but fail reasoning at the same context (RULER paper).",
     "modes.saturation":            "📈 Saturation",
     "mode_desc.saturation":        "Tells you whether a benchmark still discriminates frontier models or has saturated (e.g. MMLU 88-94% top, AIME 2025 already 96-100%). Returns top-3 + verdict + recommended replacements.",
+    "modes.hub":                   "🧭 Solutions",
+    "mode_desc.hub":               "Map of every documented LLM-eval pain → tafagent mode (if covered) + curated external tools. Find the right solution without rebuilding it. 30+ pains, 7 categories.",
     "niah.title":                  "🔍 NIAH → Reasoning Gap",
     "niah.tip":                    "NIAH (Needle in a Haystack) tests retrieval: 'find this fact in long text'. Multi-hop reasoning tests inference: 'combine facts X+Y at the start with fact Z at the end'. RULER paper (NVIDIA 2024) shows long-context models often pass NIAH but fail reasoning at the same context. This tool predicts both pass rates from architecture alone.",
     "niah.desc":                   "<strong>Your model claims 128k context. Will it actually reason at 64k, or just retrieve?</strong> Paste an HF model id and a target eval context — tool predicts NIAH and multi-hop reasoning pass rates, the gap, and a 'safe context' where reasoning stays ≥65%.",
     "help.v08.saturation.title":   "📈 Benchmark Saturation Detector",
     "help.v08.saturation.body":    "MMLU is saturated (88-94% top), AIME 2025 saturated within months of release, HumanEval near-saturated. Pick any benchmark and the tool returns top-3 frontier scores, spread, mean, and a verdict — saturated / near-saturated / discriminative — plus a recommended replacement (e.g. MMLU → MMLU-Pro / GPQA / HLE). Live fetch from DemandSphere AI Frontier Tracker (CC BY-NC 4.0) when reachable; baked 2026-05-05 snapshot when not. <em>Use case</em>: before you cite '92% on MMLU' or design an eval, check whether the benchmark still discriminates anything.",
     "inv.v08.saturation":          "<strong>📈 Saturation</strong> — is your benchmark still useful, or are all frontier models tied at the top?",
+    "inv.v081.hub":                "<strong>🧭 Solutions Hub</strong> — every documented pain mapped to a tafagent mode or curated external tool. Don't reinvent — find.",
+    "help.v081.hub.title":         "🧭 Solutions Hub",
+    "help.v081.hub.body":          "tafagent as integrator, not silo. 30+ pains across 7 categories (eval reliability · diagnostics · setup · training · retrieval · multimodal · observability), each mapped to (a) the tafagent mode that addresses it, if any, and (b) the best-of-breed external tools the community already trusts (RAGAS, MTEB, HELM, MCP Schema Validator, llm-stats, llguidance, GlitchMiner, etc.). Search box matches across pain, scenario, and tool name. <em>Use case</em>: 'I have problem X — does tafagent solve it, and if not, who does?'",
+    "hub.title":                   "🧭 Solutions Hub",
+    "hub.tip":                     "Map of every documented LLM-eval pain we know about: which tafagent mode addresses it (if any), and the best-of-breed external tools the community already trusts. Goal: full coverage. If a canonical tool exists elsewhere, we link rather than rebuild.",
+    "hub.desc":                    "<strong>Don't reinvent — find.</strong> 30+ pains mapped to tafagent modes + curated external tools. Browse by category, search by keyword, or see the gaps where new modes would help most.",
+    "hub.clear_btn":               "✕ Clear",
+    "hub.no_mode":                 "external",
+    "hub.planned":                 "planned:",
+    "hub.best_for":                "Best for",
+    "hub.not_for":                 "Not for",
+    "hub.tools":                   "External tools",
+    "hub.status.loaded":           "✅ Loaded {total} pains across {categories} categories — {covered} covered by tafagent modes, {externalLinks} external links curated. Compiled {compiled}.",
+    "hub.status.fail":             "⚠ Could not load Solutions Hub.",
+    "hub.search.empty":            "No matches for '{query}'. Try broader terms (e.g. 'eval', 'rag', 'tokenizer').",
+    "hub.search.results":          "Found {n} match(es) for '{query}'.",
     // v0.7.7 — Task tiles (UX restructure: 14 modes grouped by user intent)
     "tiles.title":                 "🎯 What do you want to do?",
     "mode_desc.niah":              "Predice tasas de pass de NIAH (retrieval) y reasoning multi-hop a cualquier contexto. Resuelve: modelos long-context pasan NIAH pero fallan reasoning al mismo contexto (paper RULER).",
     "modes.saturation":            "📈 Saturación",
     "mode_desc.saturation":        "Te dice si un benchmark sigue discriminando frontier models o ya está saturado (ej. MMLU 88-94% top, AIME 2025 ya 96-100%). Devuelve top-3 + veredicto + reemplazos recomendados.",
+    "modes.hub":                   "🧭 Soluciones",
+    "mode_desc.hub":               "Mapa de cada problema documentado de LLM-eval → mode tafagent (si cubierto) + herramientas externas curadas. Encuentra la solución sin reinventarla. 30+ pains, 7 categorías.",
     "niah.title":                  "🔍 Gap NIAH → Reasoning",
     "niah.tip":                    "NIAH (Needle in a Haystack) testea retrieval: 'encuentra este hecho en texto largo'. Reasoning multi-hop testea inferencia: 'combina hechos X+Y del principio con hecho Z del final'. El paper RULER (NVIDIA 2024) muestra que modelos long-context a menudo pasan NIAH pero fallan reasoning al mismo contexto. Esta herramienta predice ambas tasas desde la arquitectura sola.",
     "niah.desc":                   "<strong>Tu modelo dice 128k de contexto. ¿Razonará realmente a 64k, o solo encontrará?</strong> Pega un model id HF y un contexto objetivo — la herramienta predice tasas de pass NIAH y reasoning multi-hop, el gap, y un 'contexto seguro' donde reasoning se mantiene ≥65%.",
     "help.v08.saturation.title":   "📈 Detector de saturación de benchmarks",
     "help.v08.saturation.body":    "MMLU está saturado (top 88-94%), AIME 2025 saturó a los pocos meses de salir, HumanEval near-saturated. Elige cualquier benchmark y la herramienta retorna top-3 frontier scores, spread, media, y un veredicto — saturated / near-saturated / discriminative — más un reemplazo recomendado (ej. MMLU → MMLU-Pro / GPQA / HLE). Fetch en vivo desde DemandSphere AI Frontier Tracker (CC BY-NC 4.0) cuando llega; snapshot baked 2026-05-05 cuando no. <em>Caso de uso</em>: antes de citar '92% en MMLU' o diseñar una eval, verifica si el benchmark aún discrimina algo.",
     "inv.v08.saturation":          "<strong>📈 Saturation</strong> — ¿sigue siendo útil tu benchmark, o están todos los frontiers empatados arriba?",
+    "inv.v081.hub":                "<strong>🧭 Solutions Hub</strong> — cada pain documentado mapeado a un mode tafagent o herramienta externa curada. No reinventes — encuentra.",
+    "help.v081.hub.title":         "🧭 Solutions Hub",
+    "help.v081.hub.body":          "tafagent como integrador, no silo. 30+ pains en 7 categorías (eval reliability · diagnósticos · setup · training · retrieval · multimodal · observability), cada uno mapeado a (a) el mode tafagent que lo resuelve, si existe, y (b) las herramientas externas best-of-breed que la comunidad ya usa (RAGAS, MTEB, HELM, MCP Schema Validator, llm-stats, llguidance, GlitchMiner, etc.). Caja de búsqueda matchea pain, scenario, y nombre de herramienta. <em>Caso de uso</em>: 'tengo problema X — ¿lo resuelve tafagent, y si no, quién?'",
+    "hub.title":                   "🧭 Solutions Hub",
+    "hub.tip":                     "Mapa de cada pain de LLM-eval documentado: qué mode tafagent lo resuelve (si alguno), y las herramientas externas best-of-breed que la comunidad ya usa. Objetivo: cobertura total. Si la herramienta canónica existe en otra parte, enlazamos en vez de rebuildear.",
+    "hub.desc":                    "<strong>No reinventes — encuentra.</strong> 30+ pains mapeados a modes tafagent + herramientas externas curadas. Navega por categoría, busca por keyword, o ve los huecos donde nuevos modes ayudarían más.",
+    "hub.clear_btn":               "✕ Limpiar",
+    "hub.no_mode":                 "externo",
+    "hub.planned":                 "planeado:",
+    "hub.best_for":                "Mejor para",
+    "hub.not_for":                 "No para",
+    "hub.tools":                   "Herramientas externas",
+    "hub.status.loaded":           "✅ Cargados {total} pains en {categories} categorías — {covered} cubiertos por modes tafagent, {externalLinks} enlaces externos curados. Compilado {compiled}.",
+    "hub.status.fail":             "⚠ No se pudo cargar Solutions Hub.",
+    "hub.search.empty":            "Sin coincidencias para '{query}'. Prueba términos más amplios (ej. 'eval', 'rag', 'tokenizer').",
+    "hub.search.results":          "Encontradas {n} coincidencia(s) para '{query}'.",
     // v0.7.7 — Tiles de tareas (UX restructure: 14 modos agrupados por intención)
     "tiles.title":                 "🎯 ¿Qué quieres hacer?",
     "mode_desc.niah":              "Prédit les taux de réussite NIAH (retrieval) et reasoning multi-hop à n'importe quel contexte. Résout : les modèles long-context passent souvent NIAH mais échouent au reasoning au même contexte (paper RULER).",
     "modes.saturation":            "📈 Saturation",
     "mode_desc.saturation":        "Indique si un benchmark discrimine encore les frontier models ou s'il est saturé (ex. MMLU 88-94% top, AIME 2025 déjà 96-100%). Retourne top-3 + verdict + remplacements recommandés.",
+    "modes.hub":                   "🧭 Solutions",
+    "mode_desc.hub":               "Carte de chaque problème documenté de LLM-eval → mode tafagent (si couvert) + outils externes curés. Trouvez la solution sans la réinventer. 30+ pains, 7 catégories.",
     "niah.title":                  "🔍 Gap NIAH → Reasoning",
     "niah.tip":                    "NIAH (Needle in a Haystack) teste le retrieval : 'trouve ce fait dans un long texte'. Le reasoning multi-hop teste l'inférence : 'combine les faits X+Y au début avec le fait Z à la fin'. Le paper RULER (NVIDIA 2024) montre que les modèles long-context passent souvent NIAH mais échouent au reasoning au même contexte. Cet outil prédit les deux taux à partir de la seule architecture.",
     "niah.desc":                   "<strong>Votre modèle revendique 128k de contexte. Va-t-il vraiment raisonner à 64k, ou seulement retrouver ?</strong> Collez un model id HF et un contexte cible — l'outil prédit les taux de réussite NIAH et reasoning multi-hop, le gap, et un 'contexte sûr' où le reasoning reste ≥65%.",
     "help.v08.saturation.title":   "📈 Détecteur de saturation des benchmarks",
     "help.v08.saturation.body":    "MMLU est saturé (top 88-94%), AIME 2025 saturé en quelques mois après sa sortie, HumanEval presque saturé. Choisissez un benchmark et l'outil retourne top-3 frontier scores, spread, moyenne, et un verdict — saturated / near-saturated / discriminative — plus un remplacement recommandé (ex. MMLU → MMLU-Pro / GPQA / HLE). Fetch en direct depuis DemandSphere AI Frontier Tracker (CC BY-NC 4.0) si accessible ; snapshot baked 2026-05-05 sinon. <em>Cas d'usage</em> : avant de citer '92% sur MMLU' ou de concevoir une eval, vérifiez si le benchmark discrimine encore quelque chose.",
     "inv.v08.saturation":          "<strong>📈 Saturation</strong> — votre benchmark est-il encore utile, ou tous les frontiers sont-ils à égalité au sommet ?",
+    "inv.v081.hub":                "<strong>🧭 Solutions Hub</strong> — chaque pain documenté mappé à un mode tafagent ou outil externe curé. Ne réinventez pas — trouvez.",
+    "help.v081.hub.title":         "🧭 Solutions Hub",
+    "help.v081.hub.body":          "tafagent comme intégrateur, pas silo. 30+ pains à travers 7 catégories (eval reliability · diagnostics · setup · training · retrieval · multimodal · observability), chacun mappé à (a) le mode tafagent qui le résout, s'il existe, et (b) les outils externes best-of-breed que la communauté utilise déjà (RAGAS, MTEB, HELM, MCP Schema Validator, llm-stats, llguidance, GlitchMiner, etc.). La barre de recherche matche pain, scénario, et nom d'outil. <em>Cas d'usage</em> : 'j'ai le problème X — tafagent le résout-il, et sinon, qui ?'",
+    "hub.title":                   "🧭 Solutions Hub",
+    "hub.tip":                     "Carte de chaque pain de LLM-eval documenté : quel mode tafagent l'adresse (si applicable), et les outils externes best-of-breed que la communauté utilise déjà. Objectif : couverture totale. Si l'outil canonique existe ailleurs, nous lions plutôt que de reconstruire.",
+    "hub.desc":                    "<strong>Ne réinventez pas — trouvez.</strong> 30+ pains mappés à des modes tafagent + outils externes curés. Naviguez par catégorie, recherchez par mot-clé, ou voyez les lacunes où de nouveaux modes aideraient le plus.",
+    "hub.clear_btn":               "✕ Effacer",
+    "hub.no_mode":                 "externe",
+    "hub.planned":                 "prévu :",
+    "hub.best_for":                "Idéal pour",
+    "hub.not_for":                 "Pas pour",
+    "hub.tools":                   "Outils externes",
+    "hub.status.loaded":           "✅ Chargés {total} pains dans {categories} catégories — {covered} couverts par des modes tafagent, {externalLinks} liens externes curés. Compilé {compiled}.",
+    "hub.status.fail":             "⚠ Impossible de charger Solutions Hub.",
+    "hub.search.empty":            "Aucune correspondance pour '{query}'. Essayez des termes plus larges (ex. 'eval', 'rag', 'tokenizer').",
+    "hub.search.results":          "{n} correspondance(s) trouvée(s) pour '{query}'.",
     // v0.7.7 — Tuiles de tâches (refonte UX : 14 modes regroupés par intention)
     "tiles.title":                 "🎯 Que voulez-vous faire ?",
     "mode_desc.niah":              "在任意上下文下预测 NIAH（检索）与多跳 reasoning 通过率。解决：长上下文模型常常通过 NIAH 但在同一上下文上 reasoning 失败（RULER 论文）。",
     "modes.saturation":            "📈 饱和度",
     "mode_desc.saturation":        "告诉你某个 benchmark 是否仍能区分 frontier 模型，或者已经饱和（例如 MMLU 88-94% 顶部，AIME 2025 已经 96-100%）。返回 top-3 + 判定 + 推荐替代品。",
+    "modes.hub":                   "🧭 方案",
+    "mode_desc.hub":               "每个 LLM-eval 问题的地图 → tafagent 模式（若覆盖）+ 精选外部工具。找到方案而非重新发明。30+ 问题，7 类别。",
     "niah.title":                  "🔍 NIAH → Reasoning Gap",
     "niah.tip":                    "NIAH（Needle in a Haystack）测试检索：\"在长文本中找到这个事实\"。多跳 reasoning 测试推理：\"把开头的事实 X+Y 与结尾的事实 Z 结合\"。RULER 论文（NVIDIA 2024）显示长上下文模型经常通过 NIAH 但在相同上下文上 reasoning 失败。本工具仅根据架构预测两种通过率。",
     "niah.desc":                   "<strong>你的模型声称 128k 上下文。它在 64k 是真的能 reasoning，还是只能检索？</strong>粘贴 HF 模型 id 和目标 eval 上下文 — 工具预测 NIAH 与多跳 reasoning 通过率、gap，以及 reasoning 保持 ≥65% 的 \"安全上下文\"。",
     "help.v08.saturation.title":   "📈 Benchmark 饱和度检测器",
     "help.v08.saturation.body":    "MMLU 已饱和（top 88-94%），AIME 2025 上线几个月就饱和，HumanEval 接近饱和。选任何 benchmark，工具返回 top-3 frontier 分数、spread、平均，以及判定 — saturated / near-saturated / discriminative — 加上推荐替代品（例如 MMLU → MMLU-Pro / GPQA / HLE）。可达时从 DemandSphere AI Frontier Tracker（CC BY-NC 4.0）实时 fetch；不可达时使用 2026-05-05 的 baked 快照。<em>用例</em>：在引用\"92% on MMLU\"或设计 eval 之前，检查 benchmark 是否仍能区分任何东西。",
     "inv.v08.saturation":          "<strong>📈 Saturation</strong> — 你的 benchmark 还有用吗，还是所有 frontier 都在顶部并列？",
+    "inv.v081.hub":                "<strong>🧭 Solutions Hub</strong> — 每个文档化的问题都映射到一个 tafagent 模式或精选外部工具。别重复发明 — 去找。",
+    "help.v081.hub.title":         "🧭 Solutions Hub",
+    "help.v081.hub.body":          "tafagent 作为集成者而非孤岛。30+ 问题跨 7 类别（评估可靠性 · 诊断 · 设置 · 训练 · 检索 · 多模态 · 可观测性），每个映射到（a）解决它的 tafagent 模式（若存在），以及（b）社区已信任的最佳外部工具（RAGAS、MTEB、HELM、MCP Schema Validator、llm-stats、llguidance、GlitchMiner 等）。搜索框匹配 pain、场景和工具名称。<em>用例</em>：'我有问题 X — tafagent 解决它吗，如果不，谁解决？'",
+    "hub.title":                   "🧭 Solutions Hub",
+    "hub.tip":                     "我们已知的每个 LLM-eval 问题的地图：哪个 tafagent 模式能解决它（若有），以及社区已信任的最佳外部工具。目标：全覆盖。如果规范工具已在别处，我们链接而非重建。",
+    "hub.desc":                    "<strong>别重新发明 — 去找。</strong>30+ 问题映射到 tafagent 模式 + 精选外部工具。按类别浏览、按关键字搜索，或查看新模式最有帮助的空缺。",
+    "hub.clear_btn":               "✕ 清空",
+    "hub.no_mode":                 "外部",
+    "hub.planned":                 "计划：",
+    "hub.best_for":                "适合",
+    "hub.not_for":                 "不适合",
+    "hub.tools":                   "外部工具",
+    "hub.status.loaded":           "✅ 已加载 {total} 个问题，跨 {categories} 类别 — {covered} 个由 tafagent 模式覆盖，精选 {externalLinks} 个外部链接。编译于 {compiled}。",
+    "hub.status.fail":             "⚠ 无法加载 Solutions Hub。",
+    "hub.search.empty":            "无 '{query}' 的匹配。尝试更宽泛的词（如 'eval'、'rag'、'tokenizer'）。",
+    "hub.search.results":          "为 '{query}' 找到 {n} 个匹配。",
     // v0.7.7 — 任务卡片（UX 重构：按用户意图分组的 14 个模式）
     "tiles.title":                 "🎯 你想做什么？",

js/main.js CHANGED Viewed

@@ -23,6 +23,10 @@ import {
   loadSaturationKB, classifyAll, classifyBenchmark,
   listBenchmarks, attribution as saturationAttribution, tryFetchLive,
 } from "./saturation_detector.js";
 // Attach HF Hub search-as-you-type to all 5 model id inputs (Profile, Recipe,
 // Unmask, Template, Quant). Hits public huggingface.co/api/models. Idempotent.
@@ -212,6 +216,7 @@ document.addEventListener("click", (e) => {
       template: "template-section", arena: "arena-section", contam: "contam-section",
       quant: "quant-section", drift: "drift-section", niah: "niah-section",
       saturation: "saturation-section",
     }[targetMode];
     if (sectionId) {
       const sec = document.getElementById(sectionId);
@@ -236,7 +241,7 @@ document.querySelectorAll(".mode-btn").forEach(btn => {
      "diagnose-section", "phase-section", "unmask-section",
      "template-section", "arena-section", "contam-section",
      "quant-section", "drift-section", "niah-section",
-     "saturation-section"].forEach(id => {
       const el = $(id);
       if (el) el.style.display = "none";
     });
@@ -248,12 +253,14 @@ document.querySelectorAll(".mode-btn").forEach(btn => {
       template: "template-section", arena: "arena-section", contam: "contam-section",
       quant: "quant-section", drift: "drift-section", niah: "niah-section",
       saturation: "saturation-section",
     };
     const sectionId = sectionMap[mode];
     if (sectionId) $(sectionId).style.display = "";
     $("mode-desc").textContent = t(`mode_desc.${mode}`) || "";
     if (mode === "phase") initPhaseDiagram();
     if (mode === "saturation") initSaturation();
   });
 });
@@ -3277,6 +3284,106 @@ function runSaturationAll() {
 $("saturation-run-btn")?.addEventListener("click", runSaturationOne);
 $("saturation-all-btn")?.addEventListener("click", runSaturationAll);
 // ════════════════════════════════════════════════════════════════════
 // Bootstrap
 // ════════════════════════════════════════════════════════════════════

   loadSaturationKB, classifyAll, classifyBenchmark,
   listBenchmarks, attribution as saturationAttribution, tryFetchLive,
 } from "./saturation_detector.js";
+import {
+  loadHub, listCategories, listEntries, searchEntries,
+  hubStats, getCategoryMeta,
+} from "./solutions_hub.js";
 // Attach HF Hub search-as-you-type to all 5 model id inputs (Profile, Recipe,
 // Unmask, Template, Quant). Hits public huggingface.co/api/models. Idempotent.
       template: "template-section", arena: "arena-section", contam: "contam-section",
       quant: "quant-section", drift: "drift-section", niah: "niah-section",
       saturation: "saturation-section",
+      hub: "hub-section",
     }[targetMode];
     if (sectionId) {
       const sec = document.getElementById(sectionId);
      "diagnose-section", "phase-section", "unmask-section",
      "template-section", "arena-section", "contam-section",
      "quant-section", "drift-section", "niah-section",
+     "saturation-section", "hub-section"].forEach(id => {
       const el = $(id);
       if (el) el.style.display = "none";
     });
       template: "template-section", arena: "arena-section", contam: "contam-section",
       quant: "quant-section", drift: "drift-section", niah: "niah-section",
       saturation: "saturation-section",
+      hub: "hub-section",
     };
     const sectionId = sectionMap[mode];
     if (sectionId) $(sectionId).style.display = "";
     $("mode-desc").textContent = t(`mode_desc.${mode}`) || "";
     if (mode === "phase") initPhaseDiagram();
     if (mode === "saturation") initSaturation();
+    if (mode === "hub") initHub();
   });
 });
 $("saturation-run-btn")?.addEventListener("click", runSaturationOne);
 $("saturation-all-btn")?.addEventListener("click", runSaturationAll);
+// ════════════════════════════════════════════════════════════════════
+// 🧭 Solutions Hub (v0.8.1) — integrator portal
+// ════════════════════════════════════════════════════════════════════
+const HUB_TYPE_BADGE = {
+  tool: "🔧",
+  leaderboard: "📊",
+  paper: "📄",
+  article: "📝",
+  docs: "📘",
+  issue: "🐛",
+  spec: "📐",
+  benchmark: "🧪",
+};
+let __hubInited = false;
+async function initHub() {
+  if (__hubInited) return;
+  __hubInited = true;
+  try {
+    await loadHub();
+  } catch (e) {
+    $("hub-status").textContent = (t("hub.status.fail") || "⚠ Could not load Solutions Hub.") + " " + (e.message || e);
+    return;
+  }
+  const stats = hubStats();
+  $("hub-status").textContent = tFmt("hub.status.loaded", stats);
+  renderHubAll();
+}
+function renderEntry(e) {
+  const modeBadge = e.tafagent_mode
+    ? `<span class="badge" style="background:#3fb950;">${e.tafagent_mode}</span>`
+    : (e.tafagent_planned_mode
+        ? `<span class="badge" style="background:#d29922;">${t("hub.planned") || "planned:"} ${e.tafagent_planned_mode}</span>`
+        : `<span class="badge" style="background:#6e7781;">${t("hub.no_mode") || "external"}</span>`);
+  const tools = (e.external_tools || [])
+    .map(tl => {
+      const icon = HUB_TYPE_BADGE[tl.type] || "🔗";
+      return `<li>${icon} <a href="${tl.url}" target="_blank" rel="noopener noreferrer">${tl.name}</a> <span class="subtle" style="font-size:0.82em;">(${tl.type})</span></li>`;
+    })
+    .join("");
+  const bestFor = e.best_for ? `<p><strong>${t("hub.best_for") || "Best for"}:</strong> ${e.best_for}</p>` : "";
+  const notFor = e.not_for ? `<p><strong>${t("hub.not_for") || "Not for"}:</strong> ${e.not_for}</p>` : "";
+  return `
+    <details class="unmask-panel" style="margin: 0.5em 0;">
+      <summary class="unmask-panel-title">${e.pain} ${modeBadge}</summary>
+      ${bestFor}
+      ${notFor}
+      ${tools ? `<p><strong>${t("hub.tools") || "External tools"}:</strong></p><ul>${tools}</ul>` : ""}
+    </details>
+  `;
+}
+function renderHubAll() {
+  const cats = listCategories();
+  const html = cats.map(c => {
+    const entries = listEntries(c.key);
+    if (entries.length === 0) return "";
+    const inner = entries.map(renderEntry).join("");
+    return `
+      <details class="unmask-panel" open style="margin-top: 1em;">
+        <summary class="unmask-panel-title" style="font-size:1.05em;">
+          ${c.icon} ${c.label} <span class="subtle" style="font-size:0.85em;">(${c.count})</span>
+        </summary>
+        <p class="recipe-desc" style="font-style:italic;">${c.description}</p>
+        ${inner}
+      </details>
+    `;
+  }).join("");
+  $("hub-output").innerHTML = `<div class="arena-result">${html}</div>`;
+}
+function renderHubSearch(query) {
+  const matches = searchEntries(query);
+  if (matches.length === 0) {
+    $("hub-output").innerHTML = `<p class="recipe-desc">${tFmt("hub.search.empty", { query })}</p>`;
+    return;
+  }
+  const html = matches.map(renderEntry).join("");
+  $("hub-output").innerHTML = `<div class="arena-result">
+    <p class="recipe-desc">${tFmt("hub.search.results", { n: matches.length, query })}</p>
+    ${html}
+  </div>`;
+}
+let __hubSearchTimer = null;
+$("hub-search")?.addEventListener("input", (e) => {
+  clearTimeout(__hubSearchTimer);
+  const q = e.target.value;
+  __hubSearchTimer = setTimeout(() => {
+    if (!q.trim()) renderHubAll();
+    else renderHubSearch(q);
+  }, 200);
+});
+$("hub-clear-btn")?.addEventListener("click", () => {
+  $("hub-search").value = "";
+  renderHubAll();
+});
 // ════════════════════════════════════════════════════════════════════
 // Bootstrap
 // ════════════════════════════════════════════════════════════════════

js/solutions_hub.js ADDED Viewed

	@@ -0,0 +1,69 @@

+// Solutions Hub (v0.8.1)
+// tafagent as integrator/curator. Pain → tafagent mode (if shipped) +
+// external best-of-breed tools. Pure logic — no human strings; main.js
+// renders with i18n.
+let _hub = null;
+export async function loadHub(url = "./data/solutions_hub.json") {
+  if (_hub) return _hub;
+  const res = await fetch(url);
+  if (!res.ok) throw new Error(`Hub fetch failed: ${res.status}`);
+  _hub = await res.json();
+  return _hub;
+}
+export function getHub() { return _hub; }
+export function listCategories() {
+  if (!_hub) return [];
+  return Object.entries(_hub.categories).map(([key, meta]) => ({
+    key, ...meta,
+    count: _hub.entries.filter(e => e.category === key).length,
+  }));
+}
+export function listEntries(categoryKey = null) {
+  if (!_hub) return [];
+  return categoryKey
+    ? _hub.entries.filter(e => e.category === categoryKey)
+    : _hub.entries;
+}
+// Search across pain + best_for + tool names. Case-insensitive substring.
+export function searchEntries(query) {
+  if (!_hub || !query) return [];
+  const q = query.toLowerCase().trim();
+  if (!q) return [];
+  return _hub.entries.filter(e => {
+    const haystack = [
+      e.pain || "",
+      e.best_for || "",
+      e.not_for || "",
+      e.tafagent_mode || "",
+      ...(e.external_tools || []).map(t => t.name || ""),
+    ].join(" ").toLowerCase();
+    return haystack.includes(q);
+  });
+}
+export function getCategoryMeta(key) {
+  return _hub?.categories?.[key] || null;
+}
+// Stats for the inventory header.
+export function hubStats() {
+  if (!_hub) return null;
+  const entries = _hub.entries;
+  const covered = entries.filter(e => e.tafagent_mode).length;
+  const planned = entries.filter(e => e.tafagent_planned_mode).length;
+  const totalExternal = entries.reduce((acc, e) => acc + (e.external_tools?.length || 0), 0);
+  return {
+    total: entries.length,
+    covered,
+    planned,
+    externalLinks: totalExternal,
+    categories: Object.keys(_hub.categories).length,
+    compiled: _hub.compiled,
+  };
+}