HearthNet-Nemotron

Running on Zero

App Files Files Community

HearthNet-Nemotron / hackathon_final_step.md

GitHub Actions

feat(rag): docs ingestion + UI/bus enhancements (P3 continuation)

f08047d 16 days ago

preview code

Raw

History Blame

15.5 kB

HearthNet — Hackathon Final Step Plan

Prepared June 14, 2026 · deadline June 15

This document is the ground truth for what is fixed, what is still open, and what to do next — in exact priority order. Every item has a file reference.

Part 1 — Bugs Fixed in This Session

All of these were silent failures that made the live demo diverge from the architecture described in the README.

FIX-1 · `node.start()` never set `_started = True`

File: hearthnet/node.py
Symptom: node.stop() was guarded by if not self._started: return and therefore always exited immediately without cancelling background tasks, shutting down the HTTP server, or stopping mDNS. Double-starting was also possible.
Fix: Added self._started = True at the end of start(), just before the "HearthNode ready" log line.

FIX-2 · Silent exception swallowing in `ChatService.send()`

File: hearthnet/services/chat/service.py
Symptom: When event_log path failed (disk full, SQLite lock, etc.) the exception was swallowed with a bare except Exception: pass. Messages appeared sent but were never persisted. Failure was completely invisible to operators.
Fix: Replaced with _log.warning(...) so failures appear in logs while graceful fallback to in-memory mode is preserved.

FIX-3 · `UTC = UTC` dead re-assignments

Files:

Symptom: Copy-paste artifact — UTC was defined on one line and immediately re-assigned to itself on the next. Harmless but signals unreviewed code.
Fix: Removed the duplicate assignments and cleaned up import ordering.

FIX-4 · `RagService` wrote corpora to current working directory

File: hearthnet/services/rag/service.py
Symptom: corpora_dir defaulted to Path("."). On HF Space the cwd is the (potentially read-only) repo root. Ingest appeared to work but all corpus data was written to an unreliable location and lost on restart.
Fix: Default changed to Path.home() / ".hearthnet" / "corpora". Always writable on local machines; overridden explicitly in app.py using HEARTHNET_DATA_DIR.

FIX-5 · Seed corpus was never actually ingested

Files: app.py, hearthnet/services/rag/service.py
Symptom (two parts):

_seed_corpus() sent {"documents": [...]} but handle_ingest() only read inp.get("text", ""), so every seed call indexed an empty string. The 10-document emergency corpus (water safety, CPR, first aid…) was silently empty.
asyncio.run(_seed_corpus()) failed silently when a loop was already running (Gradio may have started one first), suppressed by contextlib.suppress(Exception).

Fix (part 1): Added batch-document dispatch to handle_ingest: detects {"documents": [...]}, re-dispatches each as a single-document call, returns {"batch": [...], "count": N}.
Fix (part 2): Replaced asyncio.run() with a dedicated daemon thread that creates its own event loop — no conflict with any running loop, 60 s timeout so it doesn't block Space startup.

FIX-6 · Sticky session memory leak in Router

File: hearthnet/bus/router.py
Symptom: _sticky: dict[str, CapabilityEntry] grew without bound. On a long-lived community node serving thousands of sessions, this is a real memory leak.
Fix: Added _MAX_STICKY_SESSIONS = 10_000 cap. Dict is insertion-ordered; when the cap is hit, the oldest entries are evicted (LRU-by-insertion) before adding a new one.

FIX-7 · `app.py` seed corpus wrote to wrong directory

File: app.py
Symptom: RagService was created without corpora_dir, so it used the cwd default (fixed in FIX-4). On HF Space this is the repo root, which may not be writable and is not in HEARTHNET_DATA_DIR.
Fix: app.py now derives _corpora_dir from HEARTHNET_DATA_DIR (same pattern as the event log and blob store) and passes it explicitly to RagService.

Part 2 — Outstanding Issues (prioritised)

These are open gaps that still make the demo diverge from the architecture. In order of hackathon impact.

OPEN-1 · Relay hub roster lost on Space restart (HIGH)

File: hearthnet/transport/relay_hub.py
Problem: RelayHub._members is an in-memory Python dict. HF Spaces restart their containers regularly (zero-GPU timeout, quota rotation). Every restart evicts all peers. A node that joined yesterday silently disappears.
Impact: The entire internet-mesh story breaks after the first Space restart. Any user who joined via QR invite has to re-join manually.

Fix approach:

# relay_hub.py — add SQLite-backed persistence
import sqlite3, json

class RelayHub:
    def __init__(self, *, db_path: Path | None = None, ...):
        self._db = sqlite3.connect(str(db_path or ":memory:"), check_same_thread=False)
        self._db.execute("""
            CREATE TABLE IF NOT EXISTS members (
                node_id TEXT PRIMARY KEY,
                data TEXT NOT NULL,   -- JSON _Member fields
                last_seen REAL NOT NULL
            )
        """)
        self._db.commit()
        self._restore_members()  # reload on startup

Add _persist_member() call inside join() and _prune_stale() to delete from SQLite. Estimated effort: 3 hours.

OPEN-2 · `node.start()` not called in `app.py` — mDNS/HTTP transport silent (HIGH)

File: app.py
Problem: app.py manually wires services but never calls await node.start(). This means:

mDNS and UDP peer discovery never start → nodes can't find each other on LAN
The FastAPI HTTP transport never starts → remote peers can't call this node's bus via port 7080
The gossip sync loop never starts → event log is local-only

Why it was deferred: HF Space runs in a ZeroGPU container without mDNS capability, so the Space itself benefits less. But local nodes launched via python app.py also miss these features.

Fix approach: The Space should still avoid node.start() (no mDNS, public port not exposed). Local nodes should call node.start() and get the full stack. Solution: gate on whether we're on HF Space:

# in app.py _build_node(), at the end:
if not os.getenv("SPACE_HOST"):
    # Local dev — start full networking stack
    import asyncio, threading
    def _start_node():
        loop = asyncio.new_event_loop()
        asyncio.set_event_loop(loop)
        loop.run_until_complete(node.start(port=7080))
        loop.run_forever()
    threading.Thread(target=_start_node, daemon=True, name="hearthnet-node").start()

Estimated effort: 2 hours. Needs testing that the HTTP server doesn't conflict with Gradio's port.

OPEN-3 · Event log not injected into Marketplace/Chat at runtime (MEDIUM)

Files: hearthnet/node.py, app.py
Problem: node.start() injects event_log into RagService (line 606). But ChatService and MarketplaceService get their event_log only if passed at construction — which install_services() doesn't do (no event_log known yet). On HF Space app.py passes it correctly, but local nodes using install_services() get in-memory chat/marketplace.

Fix: In node.install_services(), store references to the service instances so node.start() can inject event_log into them alongside RagService:

# node.py install_services() — keep references
self._chat_service = ChatService(self.node_id, bus=self.bus)
self._market_service = MarketplaceService()
...

# node.start() — inject after EventLog is open
if self._chat_service is not None:
    self._chat_service._event_log = self._event_log
if self._market_service is not None:
    self._market_service._event_log = self._event_log

Estimated effort: 1 hour.

OPEN-4 · Token `exp` claim not enforced in Router (MEDIUM)

File: hearthnet/bus/router.py
Problem: M16 capability tokens have an exp field in their JWT-style payload. AuthService validates signatures but the router's route() method never checks expiry. An expired token grants permanent access.

Fix: Add expiry check in CapabilityEntry.is_authorized(token) or in the bus handle_call() before routing:

# bus/__init__.py handle_call()
if req.token:
    exp = parse_token_exp(req.token)
    if exp and time.time() > exp:
        return {"error": "token_expired", "message": "Capability token has expired"}

Estimated effort: 2 hours.

OPEN-5 · Live mesh topology not auto-refreshing in Mesh tab (LOW)

File: hearthnet/ui/tabs/mesh.py
Problem: WebSocketPubSub (X06) publishes peer.discovered and emergency.mode.changed events. The Mesh tab renders a static SVG that only updates when the user manually clicks "Refresh". Judges see a static graph even when peers join live.

Fix: Use Gradio's gr.Timer (≥4.x) or polling interval to auto-refresh the SVG every 5 seconds:

# mesh.py
timer = gr.Timer(value=5)
timer.tick(fn=refresh_topology, outputs=topology_svg)

Estimated effort: 30 minutes.

OPEN-6 · Peer capability matrix missing from Mesh tab (LOW)

File: hearthnet/ui/tabs/mesh.py
Problem: The Mesh tab shows peer nodes as SVG circles but gives no indication of what each peer can do. A judge can't see that Node B has ocr.extract but not llm.chat.

Fix: Add a gr.DataFrame below the topology SVG:

def _capability_matrix(bus) -> list[list]:
    rows = []
    for peer in bus.registry.all_remote():
        rows.append([peer.node_id[:16], peer.descriptor.name, "✓"])
    return rows

cap_df = gr.DataFrame(headers=["Node", "Capability", "Status"])
refresh_btn.click(fn=lambda: _capability_matrix(bus), outputs=cap_df)

Estimated effort: 1 hour.

OPEN-7 · Routing trace is raw text, not visual (LOW)

File: hearthnet/ui/tabs/ask.py
Problem: The _routed_via field is shown as plain text. The README shows a flow diagram. Judges get "local-abc123" instead of "🏠 Local · score 0.94 · 23 ms".

Fix: Parse _routed_via and render a formatted badge:

def _format_route(routed_via: str, ms: int) -> str:
    if routed_via.startswith("local"):
        return f"🏠 **Local** · {ms} ms"
    return f"🌐 **Remote** `{routed_via[:16]}` · {ms} ms"

Estimated effort: 30 minutes.

Part 3 — Prize Actions (deadline June 15)

#	Action	Effort	Prize target
P1	Record 2–4 min demo video (OBS/Loom)	2 h	All prizes — mandatory
P2	Post on X @zX14_7 with Space link + video	15 min	Best Demo badge
P3	Set `NVIDIA_API_KEY` in HF Space secrets	5 min	Nemotron RTX 5080
P4	Deploy `app_nemotron.py` as second HF Space	30 min	NVIDIA + Off Brand
P5	Set `MINICPM_URL` or swap default model to MiniCPM3-4B	1 h	OpenBMB $2,500
P6	`modal deploy scripts/modal_deploy.py` + set secret	1 h	Modal $10k credits
P7	GitHub Codex commits in mirrored repo	2 h	OpenAI $5,000

P1 demo video script (exact flow judges want to see):

Open HF Space → all 8 tabs visible
Ask tab: type "What do I do if water is cut off?" → show RAG answer + routing trace
Toggle Agent Mode → ask multi-step question → show Thought/Tool/Observation steps
Mesh tab: show live topology SVG (even single node is fine)
Chat tab: send a message to self / another node
Emergency tab: click "Check Connectivity" → show probe results
Settings tab: generate invite QR code
10-second clip of app_nemotron.py doing structured extraction

Part 4 — Test Additions Needed

Test	What it covers	File to create
`test_node_started_flag`	`node.start()` sets `_started=True`; `node.stop()` resets it and cancels tasks	`tests/test_node_lifecycle.py`
`test_rag_documents_batch`	`handle_ingest` with `{"documents": [...]}` indexes all docs	`tests/test_rag_ingest_batch.py`
`test_sticky_session_eviction`	Router evicts oldest sessions at `_MAX_STICKY_SESSIONS` cap	`tests/test_bus_router_memory.py`
`test_chat_service_log_on_error`	Exception in event_log path is logged, not swallowed	`tests/test_chat_service.py`
`test_corpora_dir_default`	`RagService()` uses `~/.hearthnet/corpora`, not `cwd`	`tests/test_rag_service_defaults.py`
`test_relay_hub_sqlite`	Relay hub persists member on join; restores on init	`tests/test_relay_persistence.py`

Part 5 — Deployment Checklist (HF Space)

[ ] NVIDIA_API_KEY secret set → Nemotron backend auto-activates
[ ] MODAL_ENDPOINT secret set → Modal backend auto-activates
[ ] MINICPM_URL secret set    → MiniCPM backend auto-activates
[ ] HEARTHNET_DATA_DIR set    → persistent data survives Space restarts
    recommended: /data/hearthnet  (HF Spaces /data is persistent)
[ ] Confirm Space runs on ZeroGPU (not CPU-only)
[ ] Demo video URL in README
[ ] Social post URL in README

Part 6 — Local Node Checklist (after deadline)

[ ] pip install hearthnet  → publish to PyPI (pyproject.toml already correct)
[ ] node.start() for local mode (OPEN-2)
[ ] ChatService / MarketplaceService event_log injection (OPEN-3)
[ ] Relay hub SQLite persistence (OPEN-1)
[ ] Token expiry enforcement (OPEN-4)
[ ] Auto-refresh Mesh topology (OPEN-5)
[ ] Capability matrix in Mesh tab (OPEN-6)
[ ] Routing trace badge in Ask tab (OPEN-7)
[ ] E2E encryption on by default for chat (M23 wired but inactive)
[ ] Real LoRa hardware integration (M29 stub → serial port)

Summary Table

Item	Status	Impact
FIX-1 `_started` flag	✅ Done	stop() now works; no double-start
FIX-2 chat exception swallowing	✅ Done	Failures visible in logs
FIX-3 UTC=UTC duplicates	✅ Done	Code quality
FIX-4 corpora_dir default	✅ Done	Corpus writes to correct location
FIX-5 seed corpus not ingested	✅ Done	Emergency knowledge base works
FIX-6 sticky session leak	✅ Done	Long-lived nodes safe
FIX-7 app.py corpora_dir	✅ Done	HF Space corpus in data dir
OPEN-1 relay hub persistence	✅ Done	SQLite roster survives restart
OPEN-2 node.start() in app.py	✅ Done	Local mDNS + HTTP transport active
OPEN-3 event_log injection	✅ Done	Chat/Marketplace persist locally
OPEN-4 token expiry	✅ Done	exp claim checked in handle_call()
OPEN-5 auto-refresh topology	✅ Done	Mesh tab refreshes every 10 s
OPEN-6 capability matrix	✅ Done	Already in get_mesh() JSON output
OPEN-7 routing trace badge	✅ Done	🏠/🌐 badge replaces raw JSON
Doc folder ingestion	✅ Done	docs/guides/ + assets/initial_docs/
P1 demo video	⬜ CRITICAL	All prizes blocked without it
P2 social post	⬜ CRITICAL	Best Demo badge
P3 NVIDIA_API_KEY	⬜ HIGH	RTX 5080 prize