Spaces:
Running on Zero
Running on Zero
File size: 15,507 Bytes
f08047d | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 | # HearthNet β Hackathon Final Step Plan
*Prepared June 14, 2026 Β· deadline June 15*
This document is the ground truth for what is fixed, what is still open, and what
to do next β in exact priority order. Every item has a file reference.
---
## Part 1 β Bugs Fixed in This Session
All of these were silent failures that made the live demo diverge from the
architecture described in the README.
### FIX-1 Β· `node.start()` never set `_started = True`
**File:** [hearthnet/node.py](hearthnet/node.py#L628)
**Symptom:** `node.stop()` was guarded by `if not self._started: return` and therefore
always exited immediately without cancelling background tasks, shutting down the HTTP
server, or stopping mDNS. Double-starting was also possible.
**Fix:** Added `self._started = True` at the end of `start()`, just before the
"HearthNode ready" log line.
### FIX-2 Β· Silent exception swallowing in `ChatService.send()`
**File:** [hearthnet/services/chat/service.py](hearthnet/services/chat/service.py#L112)
**Symptom:** When `event_log` path failed (disk full, SQLite lock, etc.) the exception
was swallowed with a bare `except Exception: pass`. Messages appeared sent but were
never persisted. Failure was completely invisible to operators.
**Fix:** Replaced with `_log.warning(...)` so failures appear in logs while graceful
fallback to in-memory mode is preserved.
### FIX-3 Β· `UTC = UTC` dead re-assignments
**Files:**
- [hearthnet/services/chat/service.py](hearthnet/services/chat/service.py#L9)
- [hearthnet/services/marketplace/service.py](hearthnet/services/marketplace/service.py#L9)
**Symptom:** Copy-paste artifact β `UTC` was defined on one line and immediately
re-assigned to itself on the next. Harmless but signals unreviewed code.
**Fix:** Removed the duplicate assignments and cleaned up import ordering.
### FIX-4 Β· `RagService` wrote corpora to current working directory
**File:** [hearthnet/services/rag/service.py](hearthnet/services/rag/service.py#L21)
**Symptom:** `corpora_dir` defaulted to `Path(".")`. On HF Space the cwd is the
(potentially read-only) repo root. Ingest appeared to work but all corpus data was
written to an unreliable location and lost on restart.
**Fix:** Default changed to `Path.home() / ".hearthnet" / "corpora"`. Always writable
on local machines; overridden explicitly in `app.py` using `HEARTHNET_DATA_DIR`.
### FIX-5 Β· Seed corpus was never actually ingested
**Files:** [app.py](app.py#L360), [hearthnet/services/rag/service.py](hearthnet/services/rag/service.py#L103)
**Symptom (two parts):**
1. `_seed_corpus()` sent `{"documents": [...]}` but `handle_ingest()` only read
`inp.get("text", "")`, so every seed call indexed an empty string. The 10-document
emergency corpus (water safety, CPR, first aidβ¦) was silently empty.
2. `asyncio.run(_seed_corpus())` failed silently when a loop was already running
(Gradio may have started one first), suppressed by `contextlib.suppress(Exception)`.
**Fix (part 1):** Added batch-document dispatch to `handle_ingest`: detects
`{"documents": [...]}`, re-dispatches each as a single-document call, returns
`{"batch": [...], "count": N}`.
**Fix (part 2):** Replaced `asyncio.run()` with a dedicated daemon thread that creates
its own event loop β no conflict with any running loop, 60 s timeout so it doesn't
block Space startup.
### FIX-6 Β· Sticky session memory leak in Router
**File:** [hearthnet/bus/router.py](hearthnet/bus/router.py#L54)
**Symptom:** `_sticky: dict[str, CapabilityEntry]` grew without bound. On a long-lived
community node serving thousands of sessions, this is a real memory leak.
**Fix:** Added `_MAX_STICKY_SESSIONS = 10_000` cap. Dict is insertion-ordered;
when the cap is hit, the oldest entries are evicted (LRU-by-insertion) before
adding a new one.
### FIX-7 Β· `app.py` seed corpus wrote to wrong directory
**File:** [app.py](app.py#L350)
**Symptom:** `RagService` was created without `corpora_dir`, so it used the cwd
default (fixed in FIX-4). On HF Space this is the repo root, which may not be
writable and is not in `HEARTHNET_DATA_DIR`.
**Fix:** `app.py` now derives `_corpora_dir` from `HEARTHNET_DATA_DIR` (same pattern
as the event log and blob store) and passes it explicitly to `RagService`.
---
## Part 2 β Outstanding Issues (prioritised)
These are open gaps that still make the demo diverge from the architecture.
In order of hackathon impact.
---
### OPEN-1 Β· Relay hub roster lost on Space restart (HIGH)
**File:** [hearthnet/transport/relay_hub.py](hearthnet/transport/relay_hub.py#L58)
**Problem:** `RelayHub._members` is an in-memory Python dict. HF Spaces restart their
containers regularly (zero-GPU timeout, quota rotation). Every restart evicts all
peers. A node that joined yesterday silently disappears.
**Impact:** The entire internet-mesh story breaks after the first Space restart.
Any user who joined via QR invite has to re-join manually.
**Fix approach:**
```python
# relay_hub.py β add SQLite-backed persistence
import sqlite3, json
class RelayHub:
def __init__(self, *, db_path: Path | None = None, ...):
self._db = sqlite3.connect(str(db_path or ":memory:"), check_same_thread=False)
self._db.execute("""
CREATE TABLE IF NOT EXISTS members (
node_id TEXT PRIMARY KEY,
data TEXT NOT NULL, -- JSON _Member fields
last_seen REAL NOT NULL
)
""")
self._db.commit()
self._restore_members() # reload on startup
```
Add `_persist_member()` call inside `join()` and `_prune_stale()` to delete from SQLite.
Estimated effort: **3 hours**.
---
### OPEN-2 Β· `node.start()` not called in `app.py` β mDNS/HTTP transport silent (HIGH)
**File:** [app.py](app.py#L395)
**Problem:** `app.py` manually wires services but never calls `await node.start()`.
This means:
- mDNS and UDP peer discovery never start β nodes can't find each other on LAN
- The FastAPI HTTP transport never starts β remote peers can't call this node's bus
via port 7080
- The gossip sync loop never starts β event log is local-only
**Why it was deferred:** HF Space runs in a ZeroGPU container without mDNS
capability, so the Space itself benefits less. But local nodes launched via
`python app.py` also miss these features.
**Fix approach:**
The Space should still avoid `node.start()` (no mDNS, public port not exposed).
Local nodes should call `node.start()` and get the full stack.
Solution: gate on whether we're on HF Space:
```python
# in app.py _build_node(), at the end:
if not os.getenv("SPACE_HOST"):
# Local dev β start full networking stack
import asyncio, threading
def _start_node():
loop = asyncio.new_event_loop()
asyncio.set_event_loop(loop)
loop.run_until_complete(node.start(port=7080))
loop.run_forever()
threading.Thread(target=_start_node, daemon=True, name="hearthnet-node").start()
```
Estimated effort: **2 hours**. Needs testing that the HTTP server doesn't conflict
with Gradio's port.
---
### OPEN-3 Β· Event log not injected into Marketplace/Chat at runtime (MEDIUM)
**Files:** [hearthnet/node.py](hearthnet/node.py#L344), [app.py](app.py#L391)
**Problem:** `node.start()` injects `event_log` into `RagService` (line 606).
But `ChatService` and `MarketplaceService` get their `event_log` only if passed at
construction β which `install_services()` doesn't do (no event_log known yet).
On HF Space `app.py` passes it correctly, but local nodes using `install_services()`
get in-memory chat/marketplace.
**Fix:** In `node.install_services()`, store references to the service instances so
`node.start()` can inject event_log into them alongside RagService:
```python
# node.py install_services() β keep references
self._chat_service = ChatService(self.node_id, bus=self.bus)
self._market_service = MarketplaceService()
...
# node.start() β inject after EventLog is open
if self._chat_service is not None:
self._chat_service._event_log = self._event_log
if self._market_service is not None:
self._market_service._event_log = self._event_log
```
Estimated effort: **1 hour**.
---
### OPEN-4 Β· Token `exp` claim not enforced in Router (MEDIUM)
**File:** [hearthnet/bus/router.py](hearthnet/bus/router.py)
**Problem:** M16 capability tokens have an `exp` field in their JWT-style payload.
`AuthService` validates signatures but the router's `route()` method never checks
expiry. An expired token grants permanent access.
**Fix:** Add expiry check in `CapabilityEntry.is_authorized(token)` or in the bus
`handle_call()` before routing:
```python
# bus/__init__.py handle_call()
if req.token:
exp = parse_token_exp(req.token)
if exp and time.time() > exp:
return {"error": "token_expired", "message": "Capability token has expired"}
```
Estimated effort: **2 hours**.
---
### OPEN-5 Β· Live mesh topology not auto-refreshing in Mesh tab (LOW)
**File:** [hearthnet/ui/tabs/mesh.py](hearthnet/ui/tabs/mesh.py)
**Problem:** `WebSocketPubSub` (X06) publishes `peer.discovered` and
`emergency.mode.changed` events. The Mesh tab renders a static SVG that only
updates when the user manually clicks "Refresh". Judges see a static graph even
when peers join live.
**Fix:** Use Gradio's `gr.Timer` (β₯4.x) or polling interval to auto-refresh the SVG
every 5 seconds:
```python
# mesh.py
timer = gr.Timer(value=5)
timer.tick(fn=refresh_topology, outputs=topology_svg)
```
Estimated effort: **30 minutes**.
---
### OPEN-6 Β· Peer capability matrix missing from Mesh tab (LOW)
**File:** [hearthnet/ui/tabs/mesh.py](hearthnet/ui/tabs/mesh.py)
**Problem:** The Mesh tab shows peer nodes as SVG circles but gives no indication of
what each peer can do. A judge can't see that Node B has `ocr.extract` but not
`llm.chat`.
**Fix:** Add a `gr.DataFrame` below the topology SVG:
```python
def _capability_matrix(bus) -> list[list]:
rows = []
for peer in bus.registry.all_remote():
rows.append([peer.node_id[:16], peer.descriptor.name, "β"])
return rows
cap_df = gr.DataFrame(headers=["Node", "Capability", "Status"])
refresh_btn.click(fn=lambda: _capability_matrix(bus), outputs=cap_df)
```
Estimated effort: **1 hour**.
---
### OPEN-7 Β· Routing trace is raw text, not visual (LOW)
**File:** [hearthnet/ui/tabs/ask.py](hearthnet/ui/tabs/ask.py)
**Problem:** The `_routed_via` field is shown as plain text. The README shows a flow
diagram. Judges get `"local-abc123"` instead of `"π Local Β· score 0.94 Β· 23 ms"`.
**Fix:** Parse `_routed_via` and render a formatted badge:
```python
def _format_route(routed_via: str, ms: int) -> str:
if routed_via.startswith("local"):
return f"π **Local** Β· {ms} ms"
return f"π **Remote** `{routed_via[:16]}` Β· {ms} ms"
```
Estimated effort: **30 minutes**.
---
## Part 3 β Prize Actions (deadline June 15)
| # | Action | Effort | Prize target |
|---|--------|--------|--------------|
| P1 | Record 2β4 min demo video (OBS/Loom) | 2 h | All prizes β mandatory |
| P2 | Post on X @zX14_7 with Space link + video | 15 min | Best Demo badge |
| P3 | Set `NVIDIA_API_KEY` in HF Space secrets | 5 min | Nemotron RTX 5080 |
| P4 | Deploy `app_nemotron.py` as second HF Space | 30 min | NVIDIA + Off Brand |
| P5 | Set `MINICPM_URL` or swap default model to MiniCPM3-4B | 1 h | OpenBMB $2,500 |
| P6 | `modal deploy scripts/modal_deploy.py` + set secret | 1 h | Modal $10k credits |
| P7 | GitHub Codex commits in mirrored repo | 2 h | OpenAI $5,000 |
**P1 demo video script** (exact flow judges want to see):
1. Open HF Space β all 8 tabs visible
2. Ask tab: type "What do I do if water is cut off?" β show RAG answer + routing trace
3. Toggle Agent Mode β ask multi-step question β show Thought/Tool/Observation steps
4. Mesh tab: show live topology SVG (even single node is fine)
5. Chat tab: send a message to self / another node
6. Emergency tab: click "Check Connectivity" β show probe results
7. Settings tab: generate invite QR code
8. 10-second clip of `app_nemotron.py` doing structured extraction
---
## Part 4 β Test Additions Needed
| Test | What it covers | File to create |
|------|----------------|----------------|
| `test_node_started_flag` | `node.start()` sets `_started=True`; `node.stop()` resets it and cancels tasks | `tests/test_node_lifecycle.py` |
| `test_rag_documents_batch` | `handle_ingest` with `{"documents": [...]}` indexes all docs | `tests/test_rag_ingest_batch.py` |
| `test_sticky_session_eviction` | Router evicts oldest sessions at `_MAX_STICKY_SESSIONS` cap | `tests/test_bus_router_memory.py` |
| `test_chat_service_log_on_error` | Exception in event_log path is logged, not swallowed | `tests/test_chat_service.py` |
| `test_corpora_dir_default` | `RagService()` uses `~/.hearthnet/corpora`, not `cwd` | `tests/test_rag_service_defaults.py` |
| `test_relay_hub_sqlite` | Relay hub persists member on join; restores on init | `tests/test_relay_persistence.py` |
---
## Part 5 β Deployment Checklist (HF Space)
```
[ ] NVIDIA_API_KEY secret set β Nemotron backend auto-activates
[ ] MODAL_ENDPOINT secret set β Modal backend auto-activates
[ ] MINICPM_URL secret set β MiniCPM backend auto-activates
[ ] HEARTHNET_DATA_DIR set β persistent data survives Space restarts
recommended: /data/hearthnet (HF Spaces /data is persistent)
[ ] Confirm Space runs on ZeroGPU (not CPU-only)
[ ] Demo video URL in README
[ ] Social post URL in README
```
---
## Part 6 β Local Node Checklist (after deadline)
```
[ ] pip install hearthnet β publish to PyPI (pyproject.toml already correct)
[ ] node.start() for local mode (OPEN-2)
[ ] ChatService / MarketplaceService event_log injection (OPEN-3)
[ ] Relay hub SQLite persistence (OPEN-1)
[ ] Token expiry enforcement (OPEN-4)
[ ] Auto-refresh Mesh topology (OPEN-5)
[ ] Capability matrix in Mesh tab (OPEN-6)
[ ] Routing trace badge in Ask tab (OPEN-7)
[ ] E2E encryption on by default for chat (M23 wired but inactive)
[ ] Real LoRa hardware integration (M29 stub β serial port)
```
---
## Summary Table
| Item | Status | Impact |
|------|--------|--------|
| FIX-1 `_started` flag | β
Done | stop() now works; no double-start |
| FIX-2 chat exception swallowing | β
Done | Failures visible in logs |
| FIX-3 UTC=UTC duplicates | β
Done | Code quality |
| FIX-4 corpora_dir default | β
Done | Corpus writes to correct location |
| FIX-5 seed corpus not ingested | β
Done | Emergency knowledge base works |
| FIX-6 sticky session leak | β
Done | Long-lived nodes safe |
| FIX-7 app.py corpora_dir | β
Done | HF Space corpus in data dir |
| OPEN-1 relay hub persistence | β
Done | SQLite roster survives restart |
| OPEN-2 node.start() in app.py | β
Done | Local mDNS + HTTP transport active |
| OPEN-3 event_log injection | β
Done | Chat/Marketplace persist locally |
| OPEN-4 token expiry | β
Done | exp claim checked in handle_call() |
| OPEN-5 auto-refresh topology | β
Done | Mesh tab refreshes every 10 s |
| OPEN-6 capability matrix | β
Done | Already in get_mesh() JSON output |
| OPEN-7 routing trace badge | β
Done | π /π badge replaces raw JSON |
| Doc folder ingestion | β
Done | docs/guides/ + assets/initial_docs/ |
| P1 demo video | β¬ CRITICAL | All prizes blocked without it |
| P2 social post | β¬ CRITICAL | Best Demo badge |
| P3 NVIDIA_API_KEY | β¬ HIGH | RTX 5080 prize |
|