mAIndlock
Escape room where every NPC is a mortal mind of tiny LLMs
mAIndlock is an escape room where every NPC is a hierarchy of six small models pretending to be one brain. A room holds several characters; a session can have two dozen of these minds thinking at once. The whole thing runs on a laptop with the Wi-Fi off — no cloud API, no key, no network at all.
This post is the engineering side: the model stack, the runtime, how a "lifespan" is honest accounting rather than a counter I made up, and the unglamorous parts (yes, it's slow on a free CPU, and I'll tell you exactly how slow).
A single line from the player fans out into a cascade:
player line
│
├─▶ amygdala THREAT 0–10 ┐
│ └─(if threatened) rumination ×1–3, burning life │ each a real
├─▶ hippocampus MEMORY ± LEAN │ small-model call,
├─▶ striatum REWARD −5..+5 │ with its own
├─▶ ACC WORTH yes/no ┘ logits
│
├─▶ vmPFC (deterministic) value −10..+10
│
├─▶ relationship rapport += f(value, tone, substance)
│
└─▶ dlPFC (Nemotron) speaks the reply in character
Four sensing calls on MiniCPM 1B (OpenBMB), one voice call on Nemotron 3 Nano 4B (NVIDIA), and a deterministic integrator that costs zero tokens. A calm turn is ~5 model calls; a hostile one is 8+, because the amygdala ruminates under threat. The call count is the point — more on that below.
| Job | Model | Why this one |
|---|---|---|
| Sensing regions (threat, memory, habit, cost) | MiniCPM 1B (OpenBMB) | small enough to run four per turn; sharp enough to rate one dimension |
| Executive voice (dlPFC) | Nemotron 3 Nano 4B (NVIDIA) | hybrid Mamba-Transformer built for agentic reasoning on a tight token budget |
| Spoken voice — story lines & demo | VoxCPM2 (OpenBMB) | offline TTS so the key-handover lines and trailer actually speak |
| Department fine-tune | MiniCPM-V 4.6 + LoRA | distilled to make a 1B region separate sincere from cruel instead of flatlining |
| Integration (vmPFC) | deterministic value network | reliability over a model that would drift |
Total weights are ≤ 5.3B. Nemotron 3 Nano earns the "prefrontal cortex" seat specifically because it's tuned to think cheaply — which matters when thinking is the resource a character can run out of.
On the laptop the regions run on Ollama; in the Space they run on llama.cpp directly, serving GGUF weights pulled from the Hub at startup (the four sensing regions on MiniCPM, the voice on Nemotron, both quantized). Flip on airplane mode and every mind keeps thinking — there's no remote call to lose.
Two deployment details that cost me real time and might save you some:
llama-cpp-python in the Space build. It's a 30-minute compile and a build
timeout waiting to happen. I pull the prebuilt llama-server binary from the ggml-org
releases instead — a ~5 MB .tar.gz — and cache it./ and a Gradio block mounted at /about via mount_gradio_app. Going
through sdk: gradio doesn't work here — the SDK wrapper grabs port 7860 itself and collides
with uvicorn ("address already in use"). That mounted-Gradio custom front is what earns the
Off-Brand lane.Every mind starts with 1000 thinking tokens of life. This is not a decorative counter — it's
read straight from the runtime. Each region's generated tokens come from eval_count (Ollama) /
usage.completion_tokens (llama.cpp), and life burned = the sensing cascade's generated
tokens. When a character is afraid, the amygdala loops, those extra calls generate real tokens,
and the life total drops — for nothing, because rumination doesn't move the decision.
One deliberate asymmetry: the dlPFC's spoken tokens are shown but not charged. The mouth is not the mind. What costs a life is thinking, not talking.
At each quarter of life lost, the engine burns one biography fragment permanently (_burn_memories)
— the hippocampus loses access to it, and the Forgotten panel shows what's gone. The core
identity holds until death; the periphery erodes first. At zero, the mind is dark for the rest of
the run. There's no reload.
Because the regions are local, I read each one's conviction — 1 − normalized token entropy
over its top-k alternatives (src/mindlock/backend.py::_conviction) — and surface it in the skull
panel. It's a small thing that's quietly impossible on a hosted chat endpoint: the API gives you
text, not the distribution behind it. Running the model yourself is the whole reason the skull can
show you how sure the amygdala was, not just what it said.
Death, the key, rapport, and the burn are deterministic game state. The models generate — they never get to fabricate the outcome. A 4B voice can decorate the story's spine but cannot rewrite who lives or whether the key turns. That boundary is what makes the stakes real instead of improv, and it's why the vmPFC is arithmetic: the displayed value and the actual outcome are the same object.
On the Space's free CPU (2 vCPU), a warm turn is ~35 seconds and a cold start is ~70. That is the real cost of running seven small-model calls per turn on a shared CPU box, and I'm not going to pretend otherwise. Two things make it livable:
The published deliberation traces (the Sharing-is-Caring dataset) are these same cascades recorded to JSONL — one row per turn, every region's signal, token count, and conviction — so you can read a mind think without running anything at all.
docs/ARCHITECTURE.mddocs/TRACE.mdBuilt small for the Hugging Face × Gradio hackathon — Thousand Token Wood track. Minds by OpenBMB MiniCPM and NVIDIA Nemotron, on llama.cpp, fully offline. Original soundtrack by DjinAscet.
Escape room where every NPC is a mortal mind of tiny LLMs
More from this author