Multi-Agent Land
Launch an offline Gradio demo with automatic environment setup
← Part 2 · Six Playable Woods and a Fishbowl · Part 4 · How a Small Agent Decides What to Say →
This is where the field notes turn technical. Parts 1 and 2 made the pitch — a forest theater where tiny specialist models put on a show. This part opens the trapdoor and shows you the machinery under the stage. The promise it has to keep is the one from Part 1: a collaborative world-growth game, a convergent whodunit, and a twenty-questions duel are not three programs. They are the same four abstractions wearing different configs.
Here are the four.
Everything else is configuration. And the shape of how they fit together is the whole design — one write side, one scheduler, thin emitters, and many pure reads:
The arrows only ever point one way into the ledger: agents append, everything else reads. There is no arrow from an agent to another agent. Hold that picture; the rest of this part is the four boxes, one at a time.
The ledger is the spine. Agents never call each other. They append events and subscribe to the kinds they care about. No direct coupling, no shared mutable state, no race over who writes what.
[run.started ] conductor {"seed": "A village of stage props wakes up…", "scenario": "thousand-token-wood"}
[world.observed] seedkeeper {"text": "A mossy ticket booth opens in a tree root."}
[agent.spoke ] pocket-actor {"text": "I am collecting echoes to knit a ladder to the moon."}
[judge.verdict] critic {"text": "Keep it — specific and playable."}
[user.injected] visitor {"text": "A lantern starts whispering recipes."}
Every row is immutable. The stage you see, each agent's memory, the stats panel, the scrub-anywhere replay, the exported trace — all of them are projections derived from this one log. Three properties fall out of that, for free:
If the ledger is the stage, the conductor is the stage manager. It drives the loop, decides which agents act this turn, and refuses to let the show run away with your budget.
It schedules on two tracks, and most scenarios use both:
subscribes_to is queued to run before the next tick. A visitor drops a
lantern → the Echo and the Seedkeeper react immediately. A clue is found → the
Hypothesis-Former wakes up. This is event-driven reaction.schedule.tick_every: 3 fires on a fixed cadence
regardless of what anyone said. This is how a judge synthesises every few turns, or a
narrator keeps the world drifting even when the table goes quiet.The two tracks run in a fixed order every step. Reactive agents drain first, so an agent that should answer a disturbance always speaks before the scheduled rhythm resumes:
The queueing rule is one method, and it carries the single most important guardrail in the whole scheduler — never queue an agent for its own event:
def _notify_subscribers(self, event: Event) -> None:
for agent in self.scenario.agents:
if event.actor == agent.name: # never react to yourself
continue
if event.kind in agent.manifest.subscribes_to:
self._trigger_queue.append((agent, event))
Both tracks run under the governor — the runtime safety valve. Many tiny models posting to a shared board is exactly the topology that produces a surprise bill, so the governor caps calls and spend on five axes, checked before every scheduled agent:
@dataclass
class Governor:
max_turns: int = 100 # the show ends after N turns
max_calls_per_turn: int = 8 # no single turn fires more than N model calls
max_total_calls: int = 500 # whole-run call cap
max_total_tokens: int | None = None # optional token ceiling
hourly_budget_usd: float | None = None # optional spend ceiling
When a bound trips it raises a named BudgetExceeded(reason=...) that the conductor turns
into a graceful run.finished rather than a hung process. The governor gets its own
treatment; here it's enough to see it sits on
the control path, not beside it — max_calls_per_turn is also what makes the subscription
model safe, because it bounds the fan-out a single cascade can produce within one turn.
What broke, and what it taught us. An early agent that both subscribed to and emitted
agent.spokere-triggered itself on its own event — a one-agent feedback loop that burned the per-turn call cap before the judge ever fired. The fix is theevent.actor == agent.nameline above: never queue an agent for its own event. Self-cascade is the first bug a subscription model invites; name it and guard it early.
An agent is deliberately thin. The base interface is a single method — read the context the engine assembled, emit one typed event:
class Agent(ABC):
name: str
@abstractmethod
def act(
self,
run_id: str,
turn: int,
projection: StageProjection, # the current stage, folded from the log
recent_events: tuple[Event, ...], # the window it's allowed to see
) -> Event: # exactly one event back
...
It owns two things: a persona string and the single typed event it emits this turn.
It does not own its prompt layout, its memory, or any knowledge of the other agents. The
concrete ManifestAgent is driven entirely by a YAML manifest — persona, what it subscribes
to, what it may emit, its schedule, and crucially, a logical model profile rather than a
concrete model:
name: clue-gatherer
role: worker
persona: >
You are a careful Clue Gatherer. Extract exactly one new, concrete clue
from the current scene.
subscribes_to: []
may_emit: [agent.spoke]
schedule:
tick_every: 1
model_profile: fast # tiny ≤4B · fast ≤7B · balanced ≤13B · strong ≤32B
memory:
window: 8
That model_profile is the swap point. Because an agent declares only a profile, the
ModelRouter can place a different small model behind each one — a ≤4B worker next to a
stronger judge — without the agent knowing or caring:
The "agents never call each other" rule isn't decoration; it's the originality hook. The cast of a four-player bluff game are four agents that have never exchanged a line. They speak to the ledger; the ledger speaks to the world. The multi-agent drama is an emergent property of typed events and pure projections — no agent framework, no message bus, no shared memory store.
What broke, and what it taught us. The first time we ran live, the shared blackboard wasn't actually shared. Agents saw the world text and their own past lines, but not what their castmates had just said — so a small model with nothing new to react to looped on the same line, every turn, only one voice ever speaking. The deterministic offline stub hid it completely, because its responses don't depend on context shape. The fix lives in the context builder; the lesson lives here: decoupling agents is the goal, but decoupled is not the same as deaf. They must still hear each other through the ledger.
A projection is a pure function from the event list to some view. The whole stage is one small reducer — fold each event into a mutable snapshot:
def rebuild_stage(events: tuple[Event, ...], run_id: str | None = None) -> StageProjection:
projection = StageProjection()
if run_id is not None:
events = tuple(e for e in events if e.run_id == run_id)
for event in events:
projection.apply(event) # world.observed → current_scene; agent.spoke → notes; …
return projection
The stats panel folds calls and tokens. Each agent's memory is a filtered fold over the
events it's allowed to see. The Fishbowl's scrub-anywhere replay is the same fold over a
prefix of the log — rebuild_stage(events[:k]) — which is why scrubbing back through a
past show costs zero model calls: it's just the projection run over fewer events.
This is event sourcing plus CQRS in its plainest form: one write side (the ledger), many read sides (each agent's memory, the stage, the stats, the judge's transcript, the exported trace).
The Show is just one read side: the stage, the cards, the feed, and the meters are all pure projections of the same append-only log.
The observer is the cleanest expression of the rule. It consumes events read-only and
computes a ViewDiff — the delta to render — and it never appends:
def consume(self, event: Event) -> ViewDiff:
prev_scene = self._view.current_scene
prev_notes = list(self._view.agent_notes)
self._view.apply(event) # advance the read-side snapshot
diff = ViewDiff(
scene_changed=self._view.current_scene != prev_scene,
new_agent_notes=[n for n in self._view.agent_notes if n not in prev_notes],
# … new_judge_notes, new_user_artifacts
)
if diff.has_changes:
for cb in self._callbacks: # push to UI / SSE / WebSocket
cb(diff)
return diff
Rendering is a camera crew, not an actor. The world runs identically whether or not anyone is watching, you can attach several observers to one ledger at once (a stage view and a feed and a split table), and post-hoc analysis is just another observer fed a saved log. Cognition and presentation never touch.
The proof that the abstraction holds is that wildly different cognitive shapes need no engine changes — only config. Two scenarios, two YAML files, opposite scheduling topologies on the same conductor:
Thousand Token Wood is divergent. The scene gets stranger turn by turn; a seedkeeper narrates, a pocket actor wants impossible things, an echo transforms visitor disturbances, a critic decides what becomes real. Scheduling is loose and round-robin-ish. There is no winner — the ledger is the story.
Mystery Roots is convergent. A mystery is stated, a clue-gatherer extracts evidence, a
hypothesis-former proposes, a devil's advocate attacks, and a judge rules. Scheduling is a
tight multi-phase cycle that narrows toward an answer. The difference between the two lives
entirely in their config — cast list, schedule, and a competition block:
# mystery-roots.yaml # twenty-sprouts.yaml
cast: cast:
- clue-gatherer - sprout-guesser
- hypothesis-former - secret-keeper
- devils-advocate - sprout-judge
- mystery-judge competition:
competition: kind: versus
kind: judged teams:
guesser: [sprout-guesser]
keeper: [secret-keeper]
Same conductor. Same ledger. Same governor. Same context builder. Same memory. Different cast, different schedule, different cognitive shape — and the difference is entirely in those YAML files. The engine is plumbing; the scenario is data.
This isn't an aspiration we're trusting ourselves to honour. tests/test_modularity.py
builds every scenario config and asserts the invariants hold, so the first time someone
accidentally needs an engine edit to ship a new world, a test goes red. Today that's eight
scenarios standing on one engine.
| Layer | Choice | Why |
|---|---|---|
| UI | Gradio (custom-themed Fishbowl) | Fast for prototyping |
| Event schema | Pydantic v2, extra="forbid" |
Strict validation; a stray field is a loud error |
| Scheduling | Two-track: subscriptions + ticks | Reaction and heartbeat; reactive drains before the tick batch |
| Models | Small models (≤32B) behind a profile router | One cast can run several sponsor models at once |
| Memory | Ledger view, no separate store | Consistency, crash recovery, and testability for free |
| Rendering | Read-only Observer → ViewDiff |
Cognition and presentation never touch; N observers per ledger |
| Orchestration | In-process, synchronous conductor | Right size for a live demo; durable execution is available when a run needs it |
You can check out the code on GitHub, and the live demo at Hugging Face Spaces.
Launch an offline Gradio demo with automatic environment setup
More from this author