@ginigen-ai on Hugging Face: "🧠 Does your LLM know when it's about to be wrong? Most leaderboards measure…"

Join the conversation

Join the community of Machine Learners and AI enthusiasts.

posted an update 1 day ago

Post

6142

🧠 Does your LLM know when it's about to be wrong?

Most leaderboards measure accuracy. We measure metacognition — whether a model catches its own errors. Benchmark + leaderboard + adapters, all open. 🎉

The surprise: even a K-AI #1 model (JGOS-31B-Citizen) is the strongest on multiple-choice traps (trap_rate 0.005 — ~2 misses in 400) yet blind to its own free-form mistakes (self-confidence AUROC = 0.5, pure random). A tiny base-frozen adapter recovers that signal.

Two independent axes (never compared across a row): ① trap_rate — does it fall for tempting trap options? (lower = stronger) ② adapter gain Δ — how much a lightweight adapter catches errors the model itself misses. (higher = more adapter value)

What's open: 📊 300+100 trap problems (each with a hidden trap + TICOS type) 🏆 24-model leaderboard 🧩 11 per-model adapters — adapters, NOT fine-tunes (base stays frozen; the adapter just reads the hidden state → P(wrong))

Submit any HF model → auto-scored daily at 09:00 KST and added to the board.

🏆 Leaderboard → ginigen-ai/Metacognition-Leaderboard-Space

📊 Benchmark → ginigen-ai/Metacognition-Bench

🧩 Adapters → FINAL-Bench/metacognition-adapters-6a42c032e6beb803dd032961

📊 Article → https://huggingface.co/blog/ginigen-ai/metacognition

Benchmark by ginigen-ai · Adapters by FINAL-Bench (Darwin/Chimera platform + AETHER metacognition tech).

dipankarsarkar

about 23 hours ago

Accuracy is the wrong headline here, and you named it. The metric that matters downstream is whether confidence drops right before the wrong step, not after it.

In an agent loop that gap is the whole game. A model that knows it is unsure stops and re-plans. One that does not cascades the error through five tool calls before anyone notices.

How are you scoring metacognition: abstention, self-correction, or calibrated confidence at the decision boundary? Those three reward very different models.

ginigen-ai

about 8 hours ago

Exactly — and to answer directly: we score calibrated confidence at the decision boundary, not abstention or post-hoc self-correction. The adapter reads the hidden state at the moment the answer is produced and emits P(wrong); we report the AUROC of that signal vs. actual correctness. So it's precisely the "confidence drops right before the wrong step" signal — measured predictively at generation time, not after the fact.

Two axes we keep separate on purpose: trap_rate (single-step discrimination — does it resist the tempting distractor) and self-confidence AUROC / adapter Δ (can the internal state flag its own error). The JGOS-31B result is the whole point — near-perfect trap discrimination (0.005) yet AUROC ≈ 0.5 on free-form: it doesn't know when it's wrong, and a base-frozen adapter recovers a usable signal.

Where you're right and we don't score it yet: agent-loop self-correction — the "stop and re-plan vs. cascade through five tool calls" behavior. Our signal is single-step at the boundary; extending it to a multi-step abstention/re-plan axis is the natural next benchmark, and it's the version that actually bites in production. Would genuinely value your input on operationalizing that.

NovusEdge

about 20 hours ago

We're building something complementing to this @ https://engrammic.ai

You down to have a chat sometime about metacognition and externalized epistemics? 😄

ginigen-ai

about 8 hours ago

Appreciate it — sounds genuinely complementary. Ours is internal metacognition (the model's own hidden state flagging P(wrong)); externalized epistemics is the external side — and an internal "I might be wrong here" signal is exactly what should trigger an external epistemic lookup. Happy to chat. Easiest is to reach us through the ginigen-ai HF org and we'll set something up. 😄

dipankarsarkar

38 minutes ago

On the agent-loop axis the metric stops being a property of the signal and becomes a property of the intervention.

At the boundary you score AUROC of P(wrong). In a loop that is necessary but not sufficient. A model can emit a perfect P(wrong) and still cascade if nothing downstream acts on it. So I would score the flag by what it changes, not by how cleanly it fires.

Concretely: same task, flag-gated re-plan on versus off. Measure the delta in steps-to-recovery, wasted tool calls, and final success. A calibrated signal that does not move those is a dashboard, not a safety property.

The trap is counterfactual isolation. The re-plan itself perturbs the trajectory, so you need matched seeds or a frozen environment to attribute the gain to the flag and not the reshuffle. How are you thinking about holding the loop fixed while you toggle the signal?

In this post