Read The Room
Play AI-guided social scenario role‑plays
This post is about what it took to make that work.
I started by one-shotting the problem - feeding the description above with some additional technical requirements into Claude Code/Codex. With some back-and-forth prompts it worked - you could talk to the characters and they would react. But it was not fun. It was general AI experience - the social situations felt uninspired, the conversation felt like AI slop. The characters all pointed in the same direction and there was no conflict. My choices as a player didn't seem to matter all that much.
After looking at the traces, I realized the issue - the game felt general because all the characters were general - doing too much and there was no clear separation of concerns. For example, a character would say a line and then decide how this line affects the others. It doesn't work like that in the real world - you say something dumb, others decide for themselves.
So I spend most of my time fixing that. My solution was to separate concerns - coming up with who does what and enforcing it through DSPy signatures. Every model call is a typed contract that can only produce its own narrow output:
Early on the finale and the situation driver was a single call - but that either didn't create enough friction or would make the game go forever.
Each turn, only a few of the characters speak - there is no need to overwhelm the player with text. I know it's a text-based game but no one wants to feel like they are being deluged with info. The question is - which characters should speak then? I tried an explicit character selection LLM call but it felt like it was making too obvious choices - so I made the character selection semi-random - as a way to introduce more friction.
The most interesting part is the disposition matrix: every character holds a private, free-text stance toward the player, toward each other character, and toward themselves. Free text, including delta of how their opinions change across turns - and this feeds into the scene.
The game got fun roughly in proportion to how little each model call was allowed to do. Too many things at once leads to a too general experience.
I experimented with a reasoning model, and its hidden thinking was a problem - and length varies too much from model families. So I turned hidden thinking off - and gave every character an explicit reasoning output field instead: one or two sentences of private, in-character thought before the spoken line. That made it easier to debug too - as I could easily see what went into each character decision and what their internal thoughts were before saying something. They felt like actual characters, written by humans.
One issue that I battled with was pronouns. For immersion purposes, the narration is towards "you", the player. However, the characters got confused when they say the narration - they thought "you" was them.
Playtesting was the genuinely hard part, harder than any engine code. There's no terminal-bench for "I did something genuinely smart and I believe I should have been rewarded for it".
I built some very basic benchmarks - where it is obvious what should happen. For example, you swear at a character who doesn't like it - he thinks less of you. Or you do something that accomplishes the goal - the finale says so. Basically all the ways that I could think of to come up with really obvious scenarios. They did regress as I was tweaking the scenarios, so this was very helpful way to catch it without playtesting garbage.
Most of the heavy lifting in the validation was me playtesting. It was actually fun and I enjoyed it! Obviously, it would have been better if I had friends helping me but building something and making it playtestable at hackathon pace is a hard task. By the time I had something to share, the deadline was already too close for feedback.
I iterated locally on llama.cpp and Qwen3.6-27B. Used Modal to release on Gemma4-31B, which benchmarks better on creative writing. I had some issues with navigating the cold start problem but Modal was kind enough to provide enough credits for this hackathon. I didn't benchmark models for this game specifically - just didn't have the time but it's probably worth doing.
The UI is the part I would have liked to spend more time on - if only it didn't take me forever due to my lack of UI/UX experience. The two choices I'll defend: no visible friendship meters anywhere, and an end screen that finally opens the disposition matrix — how every mind ended. During the game, you only see dialogue.
There are two built-in scenarios - I found them to work better than others. For example, I experimented with High School Reunion that kind of worked and Shark Tank that didn't - the latter required making up facts and metrics. However, I focused on the engine, so adding scenarios should work for anyone who wants to try - I'm certain that more creative scenarios exist, it's just a matter of playing with it.
Play AI-guided social scenario role‑plays
More from this author