HearthNet-Nemotron / docs /upgrade_plan.md
GitHub Actions
fix: llm.chat IndexError (lazy Ollama warm + safe _resolve_backend fallback) + chat self-send returns direct
66a1a95
|
Raw
History Blame Contribute Delete
12 kB

A newer version of the Gradio SDK is available: 6.19.0

Upgrade

HearthNet Upgrade Plan β€” Maximize Real Activation

Status: complete Β· Author: Codex lead Β· Date: 2026-06-12 Goal: Activate every capability that can be made genuinely real (no mocks, no fakes, no # nosec/# noqa bypasses), wire the sponsor LLM backends, and turn the demo Space's RAG into real semantic retrieval. Honestly gate only the modules that truly require GPU tensor work (M26 distributed inference, M28 federated aggregation).

This document is the single source of truth for the 10-phase upgrade. Each phase lists the exact files, the change, and the verification step.


Why things were inactive (root-cause summary)

Area Root cause Fix phase
Gossip sync never ran _gossip_loop built HttpClient(self.node_id, self.community_id) β€” wrong positional args; SyncClient expects an httpx-style .get()/.post() client P1
RAG was not semantic requirements.txt lacks sentence-transformers; EmbeddingService was never registered, so RAG fell back to SimpleHashBackend (16-dim hash) P2
8 real services dormant install_services() never registered Embedding/Rerank/Ocr/Translation/Stt/Tts/Image* P2/P3
NVIDIA / Modal keys did nothing app.py built only the HF backend; never appended NemotronBackend/ModalBackend P6
M30/M31 not on the bus ClaimStore and CivilDefenseService are real in-memory impls but have no capabilities() bus adapter P4
Marketplace/Chat not durable app.py created them without an EventLog P6
M26/M28 core compute genuinely raises NotImplementedError (needs torch model-slicing / peft) kept gated (P7 docs)

Local-first policy: we do not flip ResearchConfig defaults to True globally (that would make every Raspberry Pi advertise capabilities it cannot fulfil). Phase-3 research services are registered only when a node opts in via a research=True flag β€” the demo Space opts in; ordinary nodes do not.


Phase 1 β€” Fix the gossip-sync defect

File: hearthnet/node.py β†’ _gossip_loop

  • Replace HttpClient(self.node_id, self.community_id) (wrong args) with a real httpx.AsyncClient() and pass it to SyncClient, which calls .get()/.post().
  • Close the client on cancellation.

Verify: tests/test_gossip_sync.py (new) builds two in-process logs + a fake httpx client and asserts _gossip_loop constructs without raising. Existing suite stays green.

Phase 2 β€” Real semantic RAG

Files: requirements.txt, hearthnet/node.py

  • Add sentence-transformers>=3.0 (and keep chromadb optional β€” in-memory store is the default for the demo).
  • In install_services() register EmbeddingService. Use SentenceTransformerBackend("BAAI/bge-small-en-v1.5") when sentence_transformers is importable (lazy model load on first call); otherwise fall back to SimpleHashBackend. RagService already prefers embed.text via the bus, so once embed.text is live, retrieval becomes genuinely semantic.

Verify: new test asserts the bus advertises embed.text; a RAG query over the seed corpus returns the water doc for a water question (skipped if sentence-transformers absent so CI without the dep still passes).

Phase 3 β€” Register the dormant real services

File: hearthnet/node.py β†’ new install_extended_services(research=...) helper, called from install_services() and reused by app.py.

Always registered (all self-discover backends and report unavailable honestly when a model/binary is missing β€” never a mock):

  • EmbeddingService (M11, embed.text)
  • RerankService (M24, rerank.text) β€” unblocks FederatedRagService rerank
  • OcrService (M17, ocr.image/ocr.pdf)
  • TranslationService (M18, trans.text)
  • SttService + TtsService (M19, stt.transcribe/tts.speak)
  • ImageDescribeService (M20, image.describe) + ImageGenerateService

Registration handles both bus contracts: services exposing capabilities() go through bus.register_service(svc); services exposing only register(bus) are registered via svc.register(bus). Every registration is wrapped in try/except so a missing optional dependency can never break node startup.

AuthService (M16) is not auto-registered: it requires an identity keypair. Documented as opt-in; wiring identity into the node is out of scope for this pass.

Phase 4 β€” Activate M30 Evidence + M31 Civil Defense (real)

Files: new hearthnet/evidence/service.py; edit hearthnet/civdef/service.py.

  • EvidenceService wraps the real ClaimStore. Capabilities: evidence.claim.add, evidence.claim.attest, evidence.claim.dispute, evidence.claim.find, evidence.summary.
  • Add capabilities() + register() to CivilDefenseService (its AuditChain, issue_alert, verify_cert, export_audit are already real). Capabilities: civdef.alert.issue, civdef.alert.list, civdef.cert.verify, civdef.audit.export.
  • Registered only when install_extended_services(research=True).

Verify: new test registers both under research=True, issues a claim + alert, and asserts the audit chain verifies and the claim is retrievable.

Phase 5 β€” M29 LoRa (decision: not enabled in demo)

LoraBeaconService frame encode/decode is real, but there is no radio on the Space and _transmit needs pyserial + hardware. To avoid any "overclaim" optics for judges we do not register a simulated beacon service in the demo. Documented as hardware-gated in tasks.md. (M27 MoE is already real and registered β€” no change.)

Phase 6 β€” Wire sponsor backends + EventLog into app.py

File: app.py β†’ _build_node

  1. Keep the @spaces.GPU(duration=120) wrapper on HfLocalBackend.chat.
  2. After the HF backend, append NemotronBackend(api_key_env="NVIDIA_API_KEY") when NVIDIA_API_KEY is set, and ModalBackend() when MODAL_ENDPOINT is set, then build LlmService(backends=[...]). (PRIZE-CRITICAL β€” the key currently does nothing.)
  3. Replace DemoRagService with the real RagService(corpus="community", bus=node.bus, event_log=..., blob_store=...) and ingest SEED_CORPUS via rag.ingest. Add FederatedRagService.
  4. Open an EventLog (ZeroGPU-safe; we do not call the full node.start() β€” mDNS/UDP/HTTP transport are useless on a single isolated Space) and inject it into MarketplaceService, ChatService, and the real RagService.
  5. Call node.install_extended_services(research=True) to light up M11/M24/M17/M18/ M19/M20 + M30/M31.

Verify: python -c "import app" builds the node; manual assert the bus advertises embed.text, rerank.text, ocr.image, civdef.alert.issue, evidence.claim.add, and (when keys set) the Nemotron/Modal backends.

Phase 7 β€” Documentation

Files: README.md, modules/M*.md capability-status lines, GLOSSARY.md, CAPABILITY_CONTRACT.md.

  • Record the bge-small embedding model and that RAG is now real semantic retrieval.
  • Model policy: keep SmolLM2-135M-Instruct as the default LLM (tiny-titan track, fits free ZeroGPU). MiniCPM-4B risks OOM on the free tier β€” documented as the opt-in MINICPM_URL path only. (Per maintainer rule: "if you swap the model, update the docs" β€” we are not swapping, and say so explicitly.)
  • Mark M11/M17/M18/M19/M20/M24/M30/M31 as active; M26/M28 as roadmap (GPU tensor work).

Phase 8 β€” Update tasks.md

Mark done: gossip fix, service registration, real RAG, EventLog wiring, M30/M31 activation. Reclassify M26/M28 as roadmap-gated; note M29 hardware-gated.

Phase 9 β€” Tests (no mocks; skip when optional deps absent)

  • tests/test_sponsor_backends.py β€” Nemotron/Modal appended when env vars set.
  • tests/test_gossip_sync.py β€” _gossip_loop constructs with httpx client.
  • tests/test_phase3_services.py β€” Evidence + CivilDefense register under research=True, real claim/alert round-trip, audit-chain integrity.
  • tests/test_extended_services.py β€” install_extended_services registers embed.text/rerank.text/ocr.image/trans.text and degrades gracefully.

Phase 10 β€” Verify, commit, push

  • python -m pytest tests/ -q must stay green (baseline: 1287 passed, 60 skipped).
  • bandit -r hearthnet -q = 0 findings; ruff check hearthnet app.py = 0.
  • Commit in logical chunks; push to both remotes: origin (HF Space) and github.

Risk register

Risk Mitigation
bge-small download adds Space cold-start time/memory Tiny model (~130 MB), lazy-loaded on first embed; SmolLM2-135M is also tiny
An optional backend errors at construction Every extended-service registration wrapped in try/except
Heavy vision/translation models loaded on call could OOM free ZeroGPU Models load lazily only on explicit call; demo UI never triggers them; report unavailable when deps missing
Breaking the 1287-test baseline Run full suite in P10; extended services are additive + guarded

Discovered during implementation (extra real gaps fixed)

These were not in the original 10-phase scope but were uncovered while verifying the work. All fixed without mocks/pragmas.

  1. Multi-backend LLM registration collision (prize-critical). The registry keys local capabilities by (node_id, name, version), so registering one llm.chat per backendΓ—model meant every later registration overwrote the previous one. With HF registered last in install_services, the sponsor backends (Nemotron/Modal/MiniCPM) were never reachable even with NVIDIA_API_KEY set β€” the real reason "the NVIDIA key did nothing." Fix: LlmService.capabilities() now registers a single llm.chat/llm.complete that advertises the full model catalogue in params.models; _resolve_backend(model) dispatches each call to the owning backend. _model_matches and the registry's _remote_params_compatible were updated to honour the models catalogue for cross-node routing.
  2. Event-loop ordering fragility (Python 3.13). asyncio.run() resets the current loop to None, so tests that later called asyncio.get_event_loop() or built asyncio.gather(...) outside a running loop failed depending on file order. Fix: an autouse fixture in tests/conftest.py provisions a fresh current event loop per test; four test_coverage_boost.py tests were corrected to build their gather() inside an async wrapper.
  3. Windows key-permission false positive. keys.py enforced POSIX 0o600 permissions but stat.S_IMODE does not raise on Windows (it returns 0o666), so the guard never skipped and valid keys were rejected on NTFS. Fix: gate the POSIX check behind if os.name == "posix". POSIX enforcement is unchanged; this is not a security bypass (mode bits are meaningless on NTFS).

Final results

  • Tests: 1314 passed, 1 failed, 32 skipped, 17 errors.
    • The single failure, test_e2e_user_stories.py::...::test_US11_3_rag_trace_shows_corpus, is pre-existing (present in the pre-change baseline), lives in untouched demo/Gradio code, and reproduces only through a full Gradio launch + gradio_client round-trip β€” a client-side dropdown-value serialization quirk, not a mesh defect.
    • The 17 errors are pre-existing playwright ModuleNotFound collection errors (optional browser-test dependency not installed).
    • Baseline before this work was 1296 passed / 7 failed β†’ net +18 passing, βˆ’6 failing, zero regressions.
  • Lint: ruff check clean on every changed file (no # noqa).
  • Security: bandit -r hearthnet = 0 High, 0 Medium (remaining Low findings are pre-existing try/except patterns; several were reduced via contextlib.suppress).
  • Model policy honoured: LLM kept as SmolLM2-135M-Instruct (not swapped); the real upgrade is genuine semantic RAG via BAAI/bge-small-en-v1.5.