76 9 344

ginipick

AI & ML interests

None yet

Recent Activity

updated a Space about 10 hours ago

ginigen/Korean-Hallucination-Leaderboard

published a Space about 10 hours ago

ginigen/Korean-Hallucination-Leaderboard

updated a dataset about 10 hours ago

ginigen/Korean-Hallucination-Bench

View all activity

Organizations

updated a Space about 10 hours ago

Korean Hallucination Leaderboard

🔥

View Korean Hallucination Resistance Leaderboard

published a Space about 10 hours ago

Korean Hallucination Leaderboard

🔥

View Korean Hallucination Resistance Leaderboard

updated a dataset about 10 hours ago

ginigen/Korean-Hallucination-Bench

Viewer • Updated about 10 hours ago • 10.2k • 2

published a dataset about 10 hours ago

ginigen/Korean-Hallucination-Bench

Viewer • Updated about 10 hours ago • 10.2k • 2

updated a Space about 13 hours ago

GiniGen AI

⚡

진단·시뮬레이션·생성의 한국 통합 AI 플랫폼

liked a model 3 days ago

FINAL-Bench/Darwin-398B-JGOS

Text Generation • 403B • Updated about 14 hours ago • 150 • 26

upvoted an article 3 days ago

Article

FINAL-Bench Quantum: An Open, Neutral Benchmark for Quantum-Computing Methods

FINAL-Bench

•

3 days ago

• 17

reacted to SeaWolf-AI's post with 🔥 3 days ago

Post

6705

🚀 Introducing FINAL-Bench Quantum — an open, neutral benchmark that finally puts quantum-computing methods on one fair yardstick.

Quantum results are notoriously hard to compare. The same "logical error rate" or "query fidelity" means very different things depending on the code, noise model, hardware, and shot count. FINAL-Bench Quantum fixes that: five events judged under identical, published protocols, where every number is labeled as either measured here or quoted from a source.

Five events: ① QEC Decoder ② Optimization (Max-Cut) ③ VQE ④ QRAM ⑤ Quantum Simulation

The rules are simple and strict:
✅ Track A (measured here, with 95% confidence intervals) is kept separate from Track B (quoted from papers, not directly comparable).
🔬 Simulation and real hardware are clearly distinguished, and no quantum-advantage claims are made.
🌍 Methods from Google, IBM, NVIDIA, USTC, Riverlane and more sit side by side, with origin flags and author credits.
📤 Anyone can submit their own method via the Submit tab for review and listing.

Already on the board: real IBM Heron r2 measurements (repetition-code distance boundary, 29–175× error reduction from d3 to d5), a real-chip QRAM query fidelity of 0.92, and H₂ VQE at chemical accuracy — always labeled honestly as simulation vs hardware.

A leaderboard is only useful if you can trust it, so neutrality is the whole point: strong competitors stay in even when they beat the host, sources are quoted faithfully, and a simulation is never rounded up into a hardware claim.

Leaderboard: FINAL-Bench/quantum-bench-leaderboard
Article: https://huggingface.co/blog/FINAL-Bench/quantum-leaderboard

#quantum #QEC #QuantumComputing #benchmark

2 replies

liked a Space 3 days ago

FINAL-Bench Quantum Leaderboard

⚛

Neutral quantum-method benchmark — QEC decoders & more

liked a model 4 days ago

FINAL-Bench/Darwin-28B-Coder-GGUF

Text Generation • 27B • Updated 1 day ago • 19.7k • 21

liked a model 10 days ago

JGOS-Model/JGOS-31B-Citizen

Image-Text-to-Text • 31B • Updated 9 days ago • 171 • 20

reacted to SeaWolf-AI's post with 🧠 20 days ago

Post

4245

Darwin-60B-DUO: Two SOTAs, One Endpoint — 88.38% on GPQA Diamond 🚀

We're excited to release Darwin-60B-DUO, the Darwin family's first DUO model. Take two domain-verified specialists, hide them behind a single OpenAI-compatible endpoint, and let a router decide which one (or both) answers. You see one model, one API — but get the best of both.

The number that matters: on the full 198-question GPQA Diamond, Darwin-60B-DUO hits 88.38%. The constituents alone land at 69.70% (Darwin-28B-REASON) and 77.27% (AWAXIS-Think-31B); a naive cascade only reaches 83.84%. The DUO clears them all. Two small specialists, intelligently routed, beat one big generalist on cost and quality. Both are independently verified — Darwin-28B-REASON is #3 on the HF GPQA Diamond leaderboard, AWAXIS-Think-31B is #1 on Korea's national K-AI Leaderboard (MSIT).

The brains is a Hybrid-A router picking one of five strategies on the fly. Korean → AWAXIS, English/STEM → Darwin (single-backend, ~70% of traffic at 1× cost). When a Korean answer needs rigorous English reasoning, split_refine fires — Darwin drafts, AWAXIS polishes; MCQ/short-answer runs both with self-consistency + cross-verify. Net effective cost: only ~1.3× a single 30B model.

The part the community will care about: the gateway is model-agnostic and Apache-2.0. Point it at any two OpenAI-compatible backends and you've got a DUO in minutes — teach router.py when to use which, and parallel calls, response merging, and routing transparency via _duo_route are handled for you. Fork it and tell us what you built.

Painless deploy: docker compose up for both vLLM backends + gateway; FP8 ~30GB colocates on a single B200/H100. One git clone (~120GB). Text-only for now, streaming in v1.1.
Two SOTAs, one endpoint. Come build your own on the Community tab.

👇
🔗 FINAL-Bench/Darwin-60B-DUO

liked a model 20 days ago

FINAL-Bench/Darwin-60B-DUO

Text Generation • Updated 14 days ago • 523 • 32

liked a Space about 1 month ago

Darwin 9B NEG

🧬

Darwin-9B-NEG reasoning model — API-served chat demo

upvoted a paper about 1 month ago

Darwin Family: MRI-Trust-Weighted Evolutionary Merging for Training-Free Scaling of Language-Model Reasoning

Paper • 2605.14386 • Published May 14 • 62

updated a Space about 1 month ago

AI

📉

Explore daily AI news, models, and spaces with quick summaries

updated a Space 2 months ago

AI News Daily

🌐

전세계 AI 트렌드 기사를 자동 수집 (최대 50개)

published a Space 2 months ago

AI News Daily

🌐

전세계 AI 트렌드 기사를 자동 수집 (최대 50개)

reacted to SeaWolf-AI's post with 👀 2 months ago

Post

3056

Why This Matters — David Defeats Goliath

MODEL: FINAL-Bench/Darwin-4B-David
SPACE: FINAL-Bench/Darwin-4B-david

We're releasing Darwin-4B-David, the first second-generation model in the Darwin Opus family. By evolving an already-evolved model, it achieves 85.0% on GPQA Diamond — surpassing its 58.6% original ancestor and even gemma-4-31B (84.3%) — with just 4.5B parameters.

Second-Generation Evolution
Most merges start from a base model and produce a single offspring. Darwin-4B-David breaks this pattern. The Father (Darwin-4B-Opus) was already evolved from gemma-4-E4B-it with Claude Opus reasoning distillation — a Gen-1 model. The Mother (DavidAU's DECKARD-Expresso-Universe) brings Unsloth deep tuning across 5 in-house datasets with thinking mode by default. Crossbreeding these two produced the first Gen-2 Darwin model.

Darwin V6's Model MRI scanned both parents across all 42 layers, assigning independent optimal ratios per layer. The Mother's creativity and Korean language hotspot (Layer 22-25, weight 0.95) was maximally absorbed, while the Father's reasoning core (Layer 30-40, weight 0.48) was preserved. This is "Merge = Evolve" applied recursively — evolution of evolution.

Benchmarks
Darwin-4B-David scores 85.0% on GPQA Diamond (+26.4%p over original 58.6%), evaluated generatively with maj@8 (8 generations per question, majority vote), Epoch AI prompt format, thinking mode enabled, 50 sampled questions. On ARC-Challenge (25-shot, loglikelihood), both score 64.93% — expected, as loglikelihood doesn't capture thinking-mode reasoning differences.

Why This Matters
gemma-4-31B (30.7B) scores 84.3%. Darwin-4B-David surpasses it at 1/7th the size — no training, no RL, just 45 minutes of MRI-guided DARE-TIES on one H100. The name "David" honors Mother creator DavidAU and evokes David vs. Goliath.

reacted to SeaWolf-AI's post with 🔥 3 months ago

Post

4696

🌍 World Model Bench — does your world model actually think?

FID measures realism. FVD measures smoothness. But neither tells you whether the model understood the scene.

We just released WM Bench — the first benchmark for cognitive intelligence in world models. The core question: when a beast charges from 3 meters away, does the model know to sprint — not walk? Does it respond differently to a human vs an animal? Does it remember the left corridor was blocked two steps ago?

Those are cognitive questions. No existing benchmark asks them. So we built one.

3 Pillars · 10 Categories · 100 Scenarios · 1,000-point scale

- 👁 P1 Perception (25%) — Can it read the scene?
- 🧠 P2 Cognition (45%) — Does it predict threats, escalate emotions, utilize memory?
- 🔥 P3 Embodiment (30%) — Does the body respond with the right motion?

All evaluation is via simple JSON I/O — no 3D engine, no special hardware. Any model with an API can participate.

We also built PROMETHEUS as a live reference implementation — runs in your browser on a T4, no install needed. Combines FloodDiffusion motion generation with a LLM cognitive brain (Perceive → Predict → Decide → Act). Scored 726/1000 (Grade B) on Track C — the only directly verified model so far. Submissions from other teams very welcome.

---

🗂 Dataset → FINAL-Bench/World-Model
🌍 Demo → FINAL-Bench/World-Model
🏆 Leaderboard → FINAL-Bench/worldmodel-bench
📝 Article → https://huggingface.co/blog/FINAL-Bench/world-model

Part of the FINAL Bench Family — alongside FINAL Bench (Feb 2026). Feedback on rubrics and missing models always welcome!

ginipick

AI & ML interests

Recent Activity

Organizations

ginipick's activity

Korean Hallucination Leaderboard

Korean Hallucination Leaderboard

GiniGen AI

FINAL-Bench Quantum: An Open, Neutral Benchmark for Quantum-Computing Methods

FINAL-Bench Quantum Leaderboard

Darwin 9B NEG

AI

AI News Daily

AI News Daily