File size: 15,507 Bytes
f08047d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
# HearthNet β€” Hackathon Final Step Plan

*Prepared June 14, 2026 Β· deadline June 15*

This document is the ground truth for what is fixed, what is still open, and what
to do next β€” in exact priority order. Every item has a file reference.

---

## Part 1 β€” Bugs Fixed in This Session

All of these were silent failures that made the live demo diverge from the
architecture described in the README.

### FIX-1 Β· `node.start()` never set `_started = True`

**File:** [hearthnet/node.py](hearthnet/node.py#L628)  
**Symptom:** `node.stop()` was guarded by `if not self._started: return` and therefore
always exited immediately without cancelling background tasks, shutting down the HTTP
server, or stopping mDNS. Double-starting was also possible.  
**Fix:** Added `self._started = True` at the end of `start()`, just before the
"HearthNode ready" log line.

### FIX-2 Β· Silent exception swallowing in `ChatService.send()`

**File:** [hearthnet/services/chat/service.py](hearthnet/services/chat/service.py#L112)  
**Symptom:** When `event_log` path failed (disk full, SQLite lock, etc.) the exception
was swallowed with a bare `except Exception: pass`. Messages appeared sent but were
never persisted. Failure was completely invisible to operators.  
**Fix:** Replaced with `_log.warning(...)` so failures appear in logs while graceful
fallback to in-memory mode is preserved.

### FIX-3 Β· `UTC = UTC` dead re-assignments

**Files:**
- [hearthnet/services/chat/service.py](hearthnet/services/chat/service.py#L9)
- [hearthnet/services/marketplace/service.py](hearthnet/services/marketplace/service.py#L9)

**Symptom:** Copy-paste artifact β€” `UTC` was defined on one line and immediately
re-assigned to itself on the next. Harmless but signals unreviewed code.  
**Fix:** Removed the duplicate assignments and cleaned up import ordering.

### FIX-4 Β· `RagService` wrote corpora to current working directory

**File:** [hearthnet/services/rag/service.py](hearthnet/services/rag/service.py#L21)  
**Symptom:** `corpora_dir` defaulted to `Path(".")`. On HF Space the cwd is the
(potentially read-only) repo root. Ingest appeared to work but all corpus data was
written to an unreliable location and lost on restart.  
**Fix:** Default changed to `Path.home() / ".hearthnet" / "corpora"`. Always writable
on local machines; overridden explicitly in `app.py` using `HEARTHNET_DATA_DIR`.

### FIX-5 Β· Seed corpus was never actually ingested

**Files:** [app.py](app.py#L360), [hearthnet/services/rag/service.py](hearthnet/services/rag/service.py#L103)  
**Symptom (two parts):**
1. `_seed_corpus()` sent `{"documents": [...]}` but `handle_ingest()` only read
   `inp.get("text", "")`, so every seed call indexed an empty string. The 10-document
   emergency corpus (water safety, CPR, first aid…) was silently empty.
2. `asyncio.run(_seed_corpus())` failed silently when a loop was already running
   (Gradio may have started one first), suppressed by `contextlib.suppress(Exception)`.

**Fix (part 1):** Added batch-document dispatch to `handle_ingest`: detects
`{"documents": [...]}`, re-dispatches each as a single-document call, returns
`{"batch": [...], "count": N}`.  
**Fix (part 2):** Replaced `asyncio.run()` with a dedicated daemon thread that creates
its own event loop β€” no conflict with any running loop, 60 s timeout so it doesn't
block Space startup.

### FIX-6 Β· Sticky session memory leak in Router

**File:** [hearthnet/bus/router.py](hearthnet/bus/router.py#L54)  
**Symptom:** `_sticky: dict[str, CapabilityEntry]` grew without bound. On a long-lived
community node serving thousands of sessions, this is a real memory leak.  
**Fix:** Added `_MAX_STICKY_SESSIONS = 10_000` cap. Dict is insertion-ordered;
when the cap is hit, the oldest entries are evicted (LRU-by-insertion) before
adding a new one.

### FIX-7 Β· `app.py` seed corpus wrote to wrong directory

**File:** [app.py](app.py#L350)  
**Symptom:** `RagService` was created without `corpora_dir`, so it used the cwd
default (fixed in FIX-4). On HF Space this is the repo root, which may not be
writable and is not in `HEARTHNET_DATA_DIR`.  
**Fix:** `app.py` now derives `_corpora_dir` from `HEARTHNET_DATA_DIR` (same pattern
as the event log and blob store) and passes it explicitly to `RagService`.

---

## Part 2 β€” Outstanding Issues (prioritised)

These are open gaps that still make the demo diverge from the architecture.
In order of hackathon impact.

---

### OPEN-1 Β· Relay hub roster lost on Space restart (HIGH)

**File:** [hearthnet/transport/relay_hub.py](hearthnet/transport/relay_hub.py#L58)  
**Problem:** `RelayHub._members` is an in-memory Python dict. HF Spaces restart their
containers regularly (zero-GPU timeout, quota rotation). Every restart evicts all
peers. A node that joined yesterday silently disappears.  
**Impact:** The entire internet-mesh story breaks after the first Space restart.
Any user who joined via QR invite has to re-join manually.

**Fix approach:**
```python
# relay_hub.py β€” add SQLite-backed persistence
import sqlite3, json

class RelayHub:
    def __init__(self, *, db_path: Path | None = None, ...):
        self._db = sqlite3.connect(str(db_path or ":memory:"), check_same_thread=False)
        self._db.execute("""
            CREATE TABLE IF NOT EXISTS members (
                node_id TEXT PRIMARY KEY,
                data TEXT NOT NULL,   -- JSON _Member fields
                last_seen REAL NOT NULL
            )
        """)
        self._db.commit()
        self._restore_members()  # reload on startup
```
Add `_persist_member()` call inside `join()` and `_prune_stale()` to delete from SQLite.
Estimated effort: **3 hours**.

---

### OPEN-2 Β· `node.start()` not called in `app.py` β€” mDNS/HTTP transport silent (HIGH)

**File:** [app.py](app.py#L395)  
**Problem:** `app.py` manually wires services but never calls `await node.start()`.
This means:
- mDNS and UDP peer discovery never start β†’ nodes can't find each other on LAN
- The FastAPI HTTP transport never starts β†’ remote peers can't call this node's bus
  via port 7080
- The gossip sync loop never starts β†’ event log is local-only

**Why it was deferred:** HF Space runs in a ZeroGPU container without mDNS
capability, so the Space itself benefits less. But local nodes launched via
`python app.py` also miss these features.

**Fix approach:**
The Space should still avoid `node.start()` (no mDNS, public port not exposed).
Local nodes should call `node.start()` and get the full stack.
Solution: gate on whether we're on HF Space:

```python
# in app.py _build_node(), at the end:
if not os.getenv("SPACE_HOST"):
    # Local dev β€” start full networking stack
    import asyncio, threading
    def _start_node():
        loop = asyncio.new_event_loop()
        asyncio.set_event_loop(loop)
        loop.run_until_complete(node.start(port=7080))
        loop.run_forever()
    threading.Thread(target=_start_node, daemon=True, name="hearthnet-node").start()
```

Estimated effort: **2 hours**. Needs testing that the HTTP server doesn't conflict
with Gradio's port.

---

### OPEN-3 Β· Event log not injected into Marketplace/Chat at runtime (MEDIUM)

**Files:** [hearthnet/node.py](hearthnet/node.py#L344), [app.py](app.py#L391)  
**Problem:** `node.start()` injects `event_log` into `RagService` (line 606).
But `ChatService` and `MarketplaceService` get their `event_log` only if passed at
construction β€” which `install_services()` doesn't do (no event_log known yet).
On HF Space `app.py` passes it correctly, but local nodes using `install_services()`
get in-memory chat/marketplace.

**Fix:** In `node.install_services()`, store references to the service instances so
`node.start()` can inject event_log into them alongside RagService:

```python
# node.py install_services() β€” keep references
self._chat_service = ChatService(self.node_id, bus=self.bus)
self._market_service = MarketplaceService()
...

# node.start() β€” inject after EventLog is open
if self._chat_service is not None:
    self._chat_service._event_log = self._event_log
if self._market_service is not None:
    self._market_service._event_log = self._event_log
```

Estimated effort: **1 hour**.

---

### OPEN-4 Β· Token `exp` claim not enforced in Router (MEDIUM)

**File:** [hearthnet/bus/router.py](hearthnet/bus/router.py)  
**Problem:** M16 capability tokens have an `exp` field in their JWT-style payload.
`AuthService` validates signatures but the router's `route()` method never checks
expiry. An expired token grants permanent access.

**Fix:** Add expiry check in `CapabilityEntry.is_authorized(token)` or in the bus
`handle_call()` before routing:

```python
# bus/__init__.py handle_call()
if req.token:
    exp = parse_token_exp(req.token)
    if exp and time.time() > exp:
        return {"error": "token_expired", "message": "Capability token has expired"}
```

Estimated effort: **2 hours**.

---

### OPEN-5 Β· Live mesh topology not auto-refreshing in Mesh tab (LOW)

**File:** [hearthnet/ui/tabs/mesh.py](hearthnet/ui/tabs/mesh.py)  
**Problem:** `WebSocketPubSub` (X06) publishes `peer.discovered` and
`emergency.mode.changed` events. The Mesh tab renders a static SVG that only
updates when the user manually clicks "Refresh". Judges see a static graph even
when peers join live.

**Fix:** Use Gradio's `gr.Timer` (β‰₯4.x) or polling interval to auto-refresh the SVG
every 5 seconds:

```python
# mesh.py
timer = gr.Timer(value=5)
timer.tick(fn=refresh_topology, outputs=topology_svg)
```

Estimated effort: **30 minutes**.

---

### OPEN-6 Β· Peer capability matrix missing from Mesh tab (LOW)

**File:** [hearthnet/ui/tabs/mesh.py](hearthnet/ui/tabs/mesh.py)  
**Problem:** The Mesh tab shows peer nodes as SVG circles but gives no indication of
what each peer can do. A judge can't see that Node B has `ocr.extract` but not
`llm.chat`.

**Fix:** Add a `gr.DataFrame` below the topology SVG:

```python
def _capability_matrix(bus) -> list[list]:
    rows = []
    for peer in bus.registry.all_remote():
        rows.append([peer.node_id[:16], peer.descriptor.name, "βœ“"])
    return rows

cap_df = gr.DataFrame(headers=["Node", "Capability", "Status"])
refresh_btn.click(fn=lambda: _capability_matrix(bus), outputs=cap_df)
```

Estimated effort: **1 hour**.

---

### OPEN-7 Β· Routing trace is raw text, not visual (LOW)

**File:** [hearthnet/ui/tabs/ask.py](hearthnet/ui/tabs/ask.py)  
**Problem:** The `_routed_via` field is shown as plain text. The README shows a flow
diagram. Judges get `"local-abc123"` instead of `"🏠 Local · score 0.94 · 23 ms"`.

**Fix:** Parse `_routed_via` and render a formatted badge:

```python
def _format_route(routed_via: str, ms: int) -> str:
    if routed_via.startswith("local"):
        return f"🏠 **Local** · {ms} ms"
    return f"🌐 **Remote** `{routed_via[:16]}` · {ms} ms"
```

Estimated effort: **30 minutes**.

---

## Part 3 β€” Prize Actions (deadline June 15)

| # | Action | Effort | Prize target |
|---|--------|--------|--------------|
| P1 | Record 2–4 min demo video (OBS/Loom) | 2 h | All prizes β€” mandatory |
| P2 | Post on X @zX14_7 with Space link + video | 15 min | Best Demo badge |
| P3 | Set `NVIDIA_API_KEY` in HF Space secrets | 5 min | Nemotron RTX 5080 |
| P4 | Deploy `app_nemotron.py` as second HF Space | 30 min | NVIDIA + Off Brand |
| P5 | Set `MINICPM_URL` or swap default model to MiniCPM3-4B | 1 h | OpenBMB $2,500 |
| P6 | `modal deploy scripts/modal_deploy.py` + set secret | 1 h | Modal $10k credits |
| P7 | GitHub Codex commits in mirrored repo | 2 h | OpenAI $5,000 |

**P1 demo video script** (exact flow judges want to see):
1. Open HF Space β†’ all 8 tabs visible
2. Ask tab: type "What do I do if water is cut off?" β†’ show RAG answer + routing trace
3. Toggle Agent Mode β†’ ask multi-step question β†’ show Thought/Tool/Observation steps
4. Mesh tab: show live topology SVG (even single node is fine)
5. Chat tab: send a message to self / another node
6. Emergency tab: click "Check Connectivity" β†’ show probe results
7. Settings tab: generate invite QR code
8. 10-second clip of `app_nemotron.py` doing structured extraction

---

## Part 4 β€” Test Additions Needed

| Test | What it covers | File to create |
|------|----------------|----------------|
| `test_node_started_flag` | `node.start()` sets `_started=True`; `node.stop()` resets it and cancels tasks | `tests/test_node_lifecycle.py` |
| `test_rag_documents_batch` | `handle_ingest` with `{"documents": [...]}` indexes all docs | `tests/test_rag_ingest_batch.py` |
| `test_sticky_session_eviction` | Router evicts oldest sessions at `_MAX_STICKY_SESSIONS` cap | `tests/test_bus_router_memory.py` |
| `test_chat_service_log_on_error` | Exception in event_log path is logged, not swallowed | `tests/test_chat_service.py` |
| `test_corpora_dir_default` | `RagService()` uses `~/.hearthnet/corpora`, not `cwd` | `tests/test_rag_service_defaults.py` |
| `test_relay_hub_sqlite` | Relay hub persists member on join; restores on init | `tests/test_relay_persistence.py` |

---

## Part 5 β€” Deployment Checklist (HF Space)

```
[ ] NVIDIA_API_KEY secret set β†’ Nemotron backend auto-activates
[ ] MODAL_ENDPOINT secret set β†’ Modal backend auto-activates
[ ] MINICPM_URL secret set    β†’ MiniCPM backend auto-activates
[ ] HEARTHNET_DATA_DIR set    β†’ persistent data survives Space restarts
    recommended: /data/hearthnet  (HF Spaces /data is persistent)
[ ] Confirm Space runs on ZeroGPU (not CPU-only)
[ ] Demo video URL in README
[ ] Social post URL in README
```

---

## Part 6 β€” Local Node Checklist (after deadline)

```
[ ] pip install hearthnet  β†’ publish to PyPI (pyproject.toml already correct)
[ ] node.start() for local mode (OPEN-2)
[ ] ChatService / MarketplaceService event_log injection (OPEN-3)
[ ] Relay hub SQLite persistence (OPEN-1)
[ ] Token expiry enforcement (OPEN-4)
[ ] Auto-refresh Mesh topology (OPEN-5)
[ ] Capability matrix in Mesh tab (OPEN-6)
[ ] Routing trace badge in Ask tab (OPEN-7)
[ ] E2E encryption on by default for chat (M23 wired but inactive)
[ ] Real LoRa hardware integration (M29 stub β†’ serial port)
```

---

## Summary Table

| Item | Status | Impact |
|------|--------|--------|
| FIX-1 `_started` flag | βœ… Done | stop() now works; no double-start |
| FIX-2 chat exception swallowing | βœ… Done | Failures visible in logs |
| FIX-3 UTC=UTC duplicates | βœ… Done | Code quality |
| FIX-4 corpora_dir default | βœ… Done | Corpus writes to correct location |
| FIX-5 seed corpus not ingested | βœ… Done | Emergency knowledge base works |
| FIX-6 sticky session leak | βœ… Done | Long-lived nodes safe |
| FIX-7 app.py corpora_dir | βœ… Done | HF Space corpus in data dir |
| OPEN-1 relay hub persistence | βœ… Done | SQLite roster survives restart |
| OPEN-2 node.start() in app.py | βœ… Done | Local mDNS + HTTP transport active |
| OPEN-3 event_log injection | βœ… Done | Chat/Marketplace persist locally |
| OPEN-4 token expiry | βœ… Done | exp claim checked in handle_call() |
| OPEN-5 auto-refresh topology | βœ… Done | Mesh tab refreshes every 10 s |
| OPEN-6 capability matrix | βœ… Done | Already in get_mesh() JSON output |
| OPEN-7 routing trace badge | βœ… Done | 🏠/🌐 badge replaces raw JSON |
| Doc folder ingestion | βœ… Done | docs/guides/ + assets/initial_docs/ |
| P1 demo video | ⬜ CRITICAL | All prizes blocked without it |
| P2 social post | ⬜ CRITICAL | Best Demo badge |
| P3 NVIDIA_API_KEY | ⬜ HIGH | RTX 5080 prize |