--- title: Mafia emoji: 🕵️ colorFrom: red colorTo: yellow sdk: gradio sdk_version: 6.18.0 python_version: 3.12.12 app_file: app.py pinned: false tags: - track:wood - sponsor:openai - sponsor:modal - achievement:welltuned - achievement:fieldnotes --- # Mafia: On the design of social deduction reasoning agents and AI-native games ![Mafia table transition](./blog_assets/mafia_table_transition.gif) **Play the Space:** [build-small-hackathon/mafia](https://huggingface.co/spaces/build-small-hackathon/mafia) **Demo video:** [YouTube](https://youtu.be/aAsZYAHKZ9Q) **Launch post:** [X / Twitter](https://x.com/alfxad/status/2066665429507137681?s=20) **Model:** [mafia-gemma-4-12B-it](https://huggingface.co/build-small-hackathon/mafia-gemma-4-12B-it) **GGUF:** [mafia-gemma-4-12B-it-gguf](https://huggingface.co/build-small-hackathon/mafia-gemma-4-12B-it-gguf) **Dataset:** [mafia-dataset](https://huggingface.co/datasets/build-small-hackathon/mafia-dataset) **Game repository:** [GitHub](https://github.com/Alfaxad/mafia) Mafia is a seven-player social deduction game: two Mafia try to survive long enough to control the vote, while a Detective, Doctor, and three Villagers try to identify and eliminate them. The loop is simple. Night actions happen in private. Dawn reports the result. Day discussion produces accusations and defenses. The table votes. Repeat until all Mafia are gone or Mafia reach parity. That simplicity makes Mafia a sharp testbed for AI agents. Good play requires hidden-state reasoning, memory, deception, timing, public argument, private objectives, and legal action control. A model that can write a plausible paragraph can still lose badly if it claims an impossible role, votes against its own evidence, leaks private knowledge, or talks at the wrong time. This project builds an AI-native Mafia game in both senses of the term. AI helped build the game through code, design, and asset generation. AI is also the game itself: six live agents play against one human, moderated by a non-player Time-to-Talk agent, with model-backed reasoning driving the table. ## The research context Social deduction is useful because the game forces language and action to meet. A player has to speak, vote, accuse, defend, hide, reveal, and update beliefs under pressure. Time-to-Talk studies asynchronous group communication in Mafia and shows that when an agent speaks matters, not only what it says. The key idea we used is a non-player moderator with a scheduler and generator: the scheduler decides whether to wait or give the floor, and the generator produces a role-hidden cue for one speaker. ![Time-to-Talk agent logic](./blog_assets/time_to_talk_agent_logic.png) *Figure 1. Time-to-Talk's agent logic: a scheduler prompt decides whether to send or wait, and a generator prompt writes the message once the floor opens. We use the same scheduler-plus-generator pattern for our non-player moderator.* Mini-Mafia isolates smaller role-play skills such as detection, disclosure, and deception. One of its strongest lessons is uncomfortable but useful: larger raw models don't automatically win. A smaller model with the right protocol can outperform a stronger model in a poorly matched setup. ![Mini-Mafia scores](./blog_assets/mini_mafia_scores.png) *Figure 2. Mini-Mafia compares social deduction skills under controlled role settings. We used this work to separate model capacity from game protocol and action formatting.* Bayesian Social Deduction and GRAIL make hidden-role tracking explicit through factor graphs, role counts, and belief propagation. That matters in Mafia because many losses come from impossible claims or votes that contradict public role evidence. ![GRAIL architecture overview](./blog_assets/grail_overview.png) *Figure 3. GRAIL tracks hidden roles with a factor graph and uses inferred beliefs to guide action selection and message generation. HOLY GRAIL keeps this role-count constraint layer, then adds social/deception ledgers and role-specific policies.* WOLF and Wolf-Enhance focus on Werewolf-style deception, suspicion, and social ledgers. ReVAC contributes a compact review loop: objective, evidence, risk, alternatives, and final action. Those ideas became the basis for our agent architecture. ![WOLF deception framework](./blog_assets/wolf_diagram.png) *Figure 4. WOLF-style work treats social deduction as memory, statements, hidden roles, and win-condition pressure. We use that framing for suspicion, deception, and vote-pressure ledgers.* ![Wolf-Enhance framework](./blog_assets/wolf_enhance_framework.png) *Figure 5. Wolf-Enhance separates listener, thinker, and presenter behavior. This influenced the split between private review, action scoring, and public message generation in HOLY GRAIL.* ## Mafia Gemma Mafia Gemma is a fine-tuned `gemma-4-12B-it` variant trained for Mafia play. The goal wasn't to make a general chat model. The goal was narrower: legal game actions, role-conditioned choices, claim tracking, deception-aware discussion, night actions, votes, and private/public information discipline. The training corpus is [mafia-dataset](https://huggingface.co/datasets/build-small-hackathon/mafia-dataset). It unifies Mini-Mafia, LLMafia, Bayesian/GRAIL Avalon traces, werewolf-derived data converted into Mafia-compatible event logs, and our own seven-player harness games. Every row is converted into a canonical schema with role, public transcript, private view, legal actions, selected action, quality labels, and leakage checks. The resulting model ships in two forms: BF16 for ZeroGPU/Modal-style full precision inference, and Q8 GGUF for cheaper llama.cpp-style deployment. In the benchmark below, we report the family as one model: `mafia-gemma-4-12B-it`. ## HOLY GRAIL The player agent architecture is **HOLY GRAIL**: **Hierarchical Objective-guided Ledgered Yield-aware Graph Reasoning Agent Informed through Language**. The name is literal. The agent starts from its role objective, updates structured ledgers, applies graph-style role constraints, considers evidence and risk, and only then emits a legal action or short public message. Mafia, Detective, Doctor, and Villager use the same architecture, but the role policy changes after assignment. ![HOLY GRAIL primary architecture diagram](./blog_assets/holy_grail_primary_architecture.png) *Figure 6. HOLY GRAIL combines current observations, structured memory, ReVAC-style review, GRAIL-style role constraints, WOLF-style social/deception signals, role-adaptive policies, legal action scoring, and message guardrails.* The most important implementation decision was to make target actions architectural, not purely model-generated. Votes, Mafia kills, Detective checks, and Doctor saves are scored by the architecture and emitted as legal JSON. The model still writes language, but the controller prevents many failure modes: impossible claims, weak herd votes, leaking private information, and ignoring public Detective evidence. ## The moderator is an agent too The moderator is not a player. It assigns roles, gates phases, controls floor timing, asks for actions, and keeps hidden information private. We use a Time-to-Talk scheduler-plus-generator pattern: the scheduler decides whether to wait or send, and the generator picks a living player and a neutral cue. In the live game, the moderator also accepts human messages and schedules them into the public discussion. ![Time-to-Talk moderator pipeline](./blog_assets/moderator_ttt_pipeline.png) *Figure 7. The non-player moderator keeps the game moving without becoming an eighth player.* ## What the benchmarks say We ran full seven-player games with classic roles and win conditions. The same non-player moderator was fixed to base Gemma 4 12B BF16. Player architectures and models changed across experiments. ### Mafia Gemma against frontier models This table combines BF16 and Q8 results under the single model family `mafia-gemma-4-12B-it`. All rows use HOLY GRAIL. | Setting | Wins | Games | Win rate | | --- | ---: | ---: | ---: | | Local side as Mafia | 4 | 10 | 40% | | Local side as Town | 7 | 10 | 70% | | Local side overall | 11 | 20 | 55% | | All player slots, team win rate | 48 | 78 | 61.5% | The local model was strongest as Town. It beat frontier-controlled Mafia in 7 of 10 pairwise games, including wins against GPT-5-mini, Claude Opus 4.8, Claude Sonnet 4.6, and Gemini depending on precision/runtime. Mafia-side play was harder: it won 4 of 10, with the Q8 runtime surprisingly strong in some Mafia-side rows. ### Mixed-table model summary | Model | Slots | Team win rate | Alive final | Messages | Votes | Claims | False claims | | --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: | | mafia-gemma-4-12B-it | 78 | 0.615 | 0.538 | 1.333 | 1.936 | 0.282 | 0.013 | | GPT-5 medium | 18 | 0.611 | 0.722 | 1.167 | 1.778 | 0.222 | 0.056 | | GPT-5-mini | 18 | 0.444 | 0.389 | 1.444 | 1.944 | 0.333 | 0.000 | | Claude Opus 4.8 | 18 | 0.611 | 0.444 | 1.500 | 2.167 | 0.389 | 0.000 | | Claude Sonnet 4.6 | 18 | 0.111 | 0.333 | 1.278 | 1.667 | 0.222 | 0.000 | | Gemini 2.5 Pro OSV | 18 | 0.889 | 0.556 | 1.500 | 2.222 | 0.167 | 0.000 | These numbers are not a universal leaderboard. They are full-game samples under one protocol and one game harness. Still, the result is the point: architecture can let a 12B local model compete in a table with frontier models, especially when the task rewards disciplined public evidence and legal action control. ### Architecture comparison Same-model games compare HOLY GRAIL v4 against ReVAC and GRAIL. Each cell uses one model for all seven players, same seeds, same moderator, same rules. | Model | Architecture | Games | Good wins | Good win rate | Good vote accuracy | Avg LLM calls | | --- | --- | ---: | ---: | ---: | ---: | ---: | | Mafia Gemma BF16 | HOLY GRAIL v4 | 3 | 3 | 1.000 | 0.678 | 38.0 | | Mafia Gemma BF16 | ReVAC | 3 | 0 | 0.000 | 0.079 | 63.7 | | Mafia Gemma BF16 | GRAIL | 3 | 0 | 0.000 | 0.116 | 39.7 | | GPT-5-mini | HOLY GRAIL v4 | 3 | 2 | 0.667 | 0.557 | 35.7 | | GPT-5-mini | ReVAC | 3 | 1 | 0.333 | 0.283 | 72.0 | | GPT-5-mini | GRAIL | 3 | 0 | 0.000 | 0.485 | 47.7 | | Claude Opus 4.8 | HOLY GRAIL v4 | 3 | 2 | 0.667 | 0.611 | 29.3 | | Claude Opus 4.8 | ReVAC | 3 | 2 | 0.667 | 0.663 | 71.3 | | Claude Opus 4.8 | GRAIL | 3 | 1 | 0.333 | 0.455 | 41.7 | HOLY GRAIL made the largest difference for Mafia Gemma and GPT-5-mini. Claude Opus was the harder case: ReVAC matched its win rate and had slightly higher good-side vote accuracy, but HOLY GRAIL used far fewer calls and still beat GRAIL. ## How the game was built The live Space uses a Gradio Server backend and a custom frontend adapted from an OpenGame-generated prototype. OpenGame gave us a fast path to cinematic state transitions, room setup, role reveal, table layout, voting, and endgame presentation. We then replaced the prototype backend with the Mafia engine, ZeroGPU model calls, private/public views, event logs, the Time-to-Talk moderator, and HOLY GRAIL agents. ![OpenGame generated demo still](./blog_assets/opengame_demo_still.png) *Figure 8. A still from the OpenGame-generated demo process. OpenGame helped bootstrap the visual direction; the final Space wires the generated interface to our own engine, agents, and ZeroGPU backend.* The game architecture follows one rule: the backend owns truth, the frontend owns presentation. The browser never receives hidden roles except the human player's own role and legal private results, such as a Detective investigation. ![Hugging Face Space runtime architecture](./blog_assets/hf_space_runtime_architecture.png) *Figure 9. The Space keeps hidden game state server-side, projects only legal views to the browser, and calls ZeroGPU for player and moderator reasoning.* During research and evaluation we used Modal for training and controlled full-game runs. The production Space uses ZeroGPU, but the same split remains: game loop, moderator, agents, model backend, logs, and metrics. ![Modal fine-tuning pipeline](./blog_assets/modal_finetuning_pipeline.png) *Figure 10. Fine-tuning pipeline for Mafia Gemma: canonical SFT data, Modal training, cached volumes, W&B tracking, evaluation, merged BF16 export, and Q8 GGUF export.* ![Modal inference runtime](./blog_assets/modal_inference_runtime.png) *Figure 11. Modal inference runtime used during evaluation: BF16 Mafia Gemma, base Gemma moderator, Q8 GGUF fallback, provider routing, secrets, caches, and response metrics.* ![Full-game evaluation pipeline](./blog_assets/full_game_evaluation_pipeline.png) *Figure 12. Full-game evaluation pipeline: scenarios and model providers feed the Mafia harness, Time-to-Talk moderator, agent architectures, legal validators, game statistics, raw games, summary CSVs, and markdown reports.* ## What this means for AI-native games Most AI game demos put a model near the game. This project puts the model inside the game loop. The player experience depends on agents remembering claims, choosing targets, responding to pressure, hiding or revealing information, and following a moderator's timing. That changes the design problem. You don't only tune prompts. You design protocols, state views, action validators, memory, latency, visual feedback, and failure recovery. A social deduction game feels alive only when the AI is accountable to the rules and expressive at the table. The most useful lesson is that model capability and architecture are coupled. Raw model strength helps, but the best Mafia agent needs structure: role objectives, private/public boundaries, social ledgers, belief constraints, timing control, and legal actions. Without those, even strong models drift. ## References - Time-to-Talk: [project](https://niveck.github.io/Time-to-Talk/) and [LLMafia dataset](https://huggingface.co/datasets/niveck/LLMafia). - Mini-Mafia: [paper](https://arxiv.org/pdf/2509.23023) and [repository](https://anonymous.4open.science/r/llm-mafia-game-5914/README.md). - WOLF: [paper](https://arxiv.org/pdf/2512.09187) and [repository](https://github.com/MrinalA2009/WOLF-Werewolf-based-Observations-for-LLM-Deception-and-Falsehoods). - Bayesian Social Deduction / GRAIL: [paper](https://arxiv.org/pdf/2506.17788), [dataset](https://huggingface.co/datasets/shahabrahimirad/bayesian-social-deduction), [repository](https://github.com/shahabrrad/Bayesian-Avalon), and [project page](https://camp-lab-purdue.github.io/bayesian-social-deduction/). - Wolf-Enhance: [paper](https://arxiv.org/pdf/2402.02330) and [repository](https://github.com/boluoweifenda/werewolf). - ReVAC: [paper](https://arxiv.org/pdf/2604.19523) and [repository](https://github.com/mihiraryaa/mindgames_NeurIPS2025). - OpenGame: [repository](https://github.com/leigest519/OpenGame). ## Self-review - Contribution: The post states the game, model, dataset, architecture, and Space deployment clearly. - Writing clarity: Each section has one job, and tables carry the numerical claims. - Experimental strength: Claims are tied to the two benchmark reports used during development. - Evaluation completeness: The post notes that the numbers are full-game samples, not universal rankings. - Method design soundness: The architecture claims are grounded in the controller layers and action validation design.