Twenty-four minds on one laptop, in airplane mode: the engineering behind mAIndlock

Community Article Published June 13, 2026

banner

https://youtu.be/zEwnVR1kTZU

mAIndlock is an escape room where every NPC is a hierarchy of six small models pretending to be one brain. A room holds several characters; a session can have two dozen of these minds thinking at once. The whole thing runs on a laptop with the Wi-Fi off — no cloud API, no key, no network at all.

This post is the engineering side: the model stack, the runtime, how a "lifespan" is honest accounting rather than a counter I made up, and the unglamorous parts (yes, it's slow on a free CPU, and I'll tell you exactly how slow).

One NPC turn is seven model calls

A single line from the player fans out into a cascade:

player line
   │
   ├─▶ amygdala      THREAT 0–10        ┐
   │     └─(if threatened) rumination ×1–3, burning life     │ each a real
   ├─▶ hippocampus   MEMORY ± LEAN      │ small-model call,
   ├─▶ striatum      REWARD −5..+5      │ with its own
   ├─▶ ACC           WORTH yes/no       ┘ logits
   │
   ├─▶ vmPFC (deterministic)   value −10..+10
   │
   ├─▶ relationship  rapport += f(value, tone, substance)
   │
   └─▶ dlPFC (Nemotron)  speaks the reply in character

cascade

Four sensing calls on MiniCPM 1B (OpenBMB), one voice call on Nemotron 3 Nano 4B (NVIDIA), and a deterministic integrator that costs zero tokens. A calm turn is ~5 model calls; a hostile one is 8+, because the amygdala ruminates under threat. The call count is the point — more on that below.

Five model families, each doing one job

Job Model Why this one
Sensing regions (threat, memory, habit, cost) MiniCPM 1B (OpenBMB) small enough to run four per turn; sharp enough to rate one dimension
Executive voice (dlPFC) Nemotron 3 Nano 4B (NVIDIA) hybrid Mamba-Transformer built for agentic reasoning on a tight token budget
Spoken voice — story lines & demo VoxCPM2 (OpenBMB) offline TTS so the key-handover lines and trailer actually speak
Department fine-tune MiniCPM-V 4.6 + LoRA distilled to make a 1B region separate sincere from cruel instead of flatlining
Integration (vmPFC) deterministic value network reliability over a model that would drift

Total weights are ≤ 5.3B. Nemotron 3 Nano earns the "prefrontal cortex" seat specifically because it's tuned to think cheaply — which matters when thinking is the resource a character can run out of.

The runtime: llama.cpp, and a Space that isn't a Gradio app

On the laptop the regions run on Ollama; in the Space they run on llama.cpp directly, serving GGUF weights pulled from the Hub at startup (the four sensing regions on MiniCPM, the voice on Nemotron, both quantized). Flip on airplane mode and every mind keeps thinking — there's no remote call to lose.

Two deployment details that cost me real time and might save you some:

  • Don't compile llama-cpp-python in the Space build. It's a 30-minute compile and a build timeout waiting to happen. I pull the prebuilt llama-server binary from the ggml-org releases instead — a ~5 MB .tar.gz — and cache it.
  • The Gradio-inside rule, the hard way. The hackathon wants Gradio in the app, but I wanted a custom canvas game, not a stock Gradio UI. The fix is the Docker SDK with the game served from FastAPI on / and a Gradio block mounted at /about via mount_gradio_app. Going through sdk: gradio doesn't work here — the SDK wrapper grabs port 7860 itself and collides with uvicorn ("address already in use"). That mounted-Gradio custom front is what earns the Off-Brand lane.

A lifespan that's real accounting

Every mind starts with 1000 thinking tokens of life. This is not a decorative counter — it's read straight from the runtime. Each region's generated tokens come from eval_count (Ollama) / usage.completion_tokens (llama.cpp), and life burned = the sensing cascade's generated tokens. When a character is afraid, the amygdala loops, those extra calls generate real tokens, and the life total drops — for nothing, because rumination doesn't move the decision.

One deliberate asymmetry: the dlPFC's spoken tokens are shown but not charged. The mouth is not the mind. What costs a life is thinking, not talking.

At each quarter of life lost, the engine burns one biography fragment permanently (_burn_memories) — the hippocampus loses access to it, and the Forgotten panel shows what's gone. The core identity holds until death; the periphery erodes first. At zero, the mind is dark for the rest of the run. There's no reload.

Conviction you can only get locally

Because the regions are local, I read each one's conviction1 − normalized token entropy over its top-k alternatives (src/mindlock/backend.py::_conviction) — and surface it in the skull panel. It's a small thing that's quietly impossible on a hosted chat endpoint: the API gives you text, not the distribution behind it. Running the model yourself is the whole reason the skull can show you how sure the amygdala was, not just what it said.

Code rules, the model dreams

Death, the key, rapport, and the burn are deterministic game state. The models generate — they never get to fabricate the outcome. A 4B voice can decorate the story's spine but cannot rewrite who lives or whether the key turns. That boundary is what makes the stakes real instead of improv, and it's why the vmPFC is arithmetic: the displayed value and the actual outcome are the same object.

The honest part: latency

On the Space's free CPU (2 vCPU), a warm turn is ~35 seconds and a cold start is ~70. That is the real cost of running seven small-model calls per turn on a shared CPU box, and I'm not going to pretend otherwise. Two things make it livable:

  • Instant demo. Menu → 👁 Watch a mind replays a real recorded session with zero model calls, so the six-region cascade, the token burn, and a death are visible in ten seconds without waiting on hardware.
  • On a laptop with Ollama and any GPU, turns are conversational. The slowness is the free-tier CPU, not the architecture.

The published deliberation traces (the Sharing-is-Caring dataset) are these same cascades recorded to JSONL — one row per turn, every region's signal, token count, and conviction — so you can read a mind think without running anything at all.

Links

Built small for the Hugging Face × Gradio hackathon — Thousand Token Wood track. Minds by OpenBMB MiniCPM and NVIDIA Nemotron, on llama.cpp, fully offline. Original soundtrack by DjinAscet.

credits

Community

Sign up or log in to comment