HearthNet-Nemotron / docs /M09-emergency.md
Chris4K's picture
prd splitted + contracts
6f9a5fd
|
Raw
History Blame
6.89 kB
# M09 β€” Emergency Mode Detector
**Spec version:** v1.0
**Depends on:** M03 (bus, to deregister internet-dependent capabilities), X04 (config), X03 (observability), `httpx`, `socket`
**Depended on by:** M08 (UI shows banner), M04 (re-registers internet backends on restore), M02 (increases discovery cadence)
---
## 1. Responsibility
Detect whether the node has working internet access. Publish state transitions locally. Cause the bus to deregister/re-register internet-dependent capabilities and let other modules react.
Out of scope:
- VPN / overlay status
- Per-service connectivity checks
- Cellular signal strength
---
## 2. File layout
```
hearthnet/emergency/
β”œβ”€β”€ __init__.py
β”œβ”€β”€ detector.py # Detector: probe loop, state machine
└── state.py # EmergencyState dataclass + StateBus
```
---
## 3. Public API
### 3.1 `state.py`
```python
# hearthnet/emergency/state.py
from dataclasses import dataclass
from typing import Literal
Mode = Literal["online", "degraded", "offline"]
@dataclass(frozen=True)
class EmergencyState:
mode: Mode
since: str # RFC 3339
last_probe: str
probe_results: dict[str, bool] # target β†’ success
class StateBus:
"""In-process pubsub for state changes. UI and other modules subscribe."""
def __init__(self): ...
def current(self) -> EmergencyState: ...
async def subscribe(self) -> AsyncIterator[EmergencyState]: ...
def _emit(self, state: EmergencyState) -> None: ... # internal
```
### 3.2 `detector.py`
```python
# hearthnet/emergency/detector.py
class Detector:
def __init__(
self,
config: EmergencyConfig,
bus: CapabilityBus,
state_bus: StateBus,
):
...
async def run(self) -> None:
"""Main loop. Cancel-safe.
Probe cadence:
- online β†’ every EMERGENCY_PROBE_INTERVAL_ONLINE (10s)
- degraded β†’ every EMERGENCY_PROBE_INTERVAL_OFFLINE (2s)
- offline β†’ every EMERGENCY_PROBE_INTERVAL_OFFLINE (2s)
Each tick:
1. probe all targets concurrently with 2s timeout
2. compute new mode
3. apply debounce (EMERGENCY_TRANSITION_DEBOUNCE_SECONDS, anti-flap)
4. if mode changed:
- state_bus._emit(new_state)
- if entered offline: bus deregisters internet-dependent capabilities
- if entered online: bus re-registers them
- emit log + metric
"""
async def shutdown(self) -> None: ...
# --- probe primitives ---
async def _probe_dns(self, host: str) -> bool: ...
async def _probe_http(self, url: str) -> bool: ...
```
---
## 4. State machine
```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β” any probe fails β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ ONLINE β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Ίβ”‚ DEGRADED β”‚
β””β”€β”€β”€β”¬β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜
β–² β”‚ β‰₯2 probes fail for 30s
β”‚ all probes pass for 10s β–Ό
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
└─────────────────────────── OFFLINE β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```
Anti-flap: if more than 3 transitions occur within 60 seconds, the detector stays in the more pessimistic state (degraded or offline) until the window passes.
---
## 5. Behaviour
### 5.1 Probes
Default targets (from `EmergencyConfig.probe_targets`):
- `1.1.1.1` (DNS A query)
- `8.8.8.8` (DNS A query)
- `cloudflare.com` (HTTPS HEAD)
- `quad9.net` (HTTPS HEAD)
Mode rule:
- `online` requires all 4 succeed
- `offline` requires β‰₯ 2 to fail
- everything between is `degraded`
### 5.2 Effects on the bus
When entering `offline`:
```python
for entry in bus.registry.all_local():
if entry.descriptor.params.get("requires_internet"):
bus.registry.deregister_local(entry.descriptor.name, entry.descriptor.version)
log.info("offline.deregistered", capability=entry.descriptor.name)
```
When returning to `online`:
```python
for backend in llm_service._backends:
if backend.requires_internet:
llm_service._register_backend(backend) # re-emit descriptors
```
`requires_internet` is a convention: services that wrap remote APIs (`anthropic_api`, `hf_api`) set this flag on their `BackendModel` and inject it into the capability descriptor params at registration time.
### 5.3 Effects on M02 discovery
Detector also calls `peer_registry.set_pruning_aggressive(offline)`:
- Offline: prune stale peers after 30 s instead of 90
- Online: standard 90 s
This makes offline mode adapt faster to neighbour churn.
### 5.4 UI surface (M08 consumes)
The state bus is the source for the amber `INTERNET OFFLINE β€” LOKAL AKTIV` banner. UI subscribes; flips theme; switches LLM passthrough to local-only backends visibly.
### 5.5 Clock sanity probe (only when online)
When online for β‰₯ 30 s, send an extra HEAD to a single anchor and check the `Date` header. If our system clock differs by > 60 s, log a warning. We do NOT auto-correct.
### 5.6 No on-wire pubsub
`emergency.mode.changed` is local only ([CONTRACT Β§8](../CAPABILITY_CONTRACT.md)). Other nodes do their own detection.
---
## 6. Errors
This module raises nothing externally; all failures are logged. Internal probe failures are the *normal* signal that drives state.
---
## 7. Configuration
From [X04 Β§3](../cross-cutting/X04-config.md):
```python
config.emergency.probe_targets # list[str]
```
Constants: `EMERGENCY_PROBE_INTERVAL_ONLINE`, `EMERGENCY_PROBE_INTERVAL_OFFLINE`, `EMERGENCY_PROBE_TIMEOUT_SECONDS`, `EMERGENCY_TRANSITION_DEBOUNCE_SECONDS`.
---
## 8. Tests
### Unit
- `test_state_transitions_with_synthetic_probes`
- `test_anti_flap_holds_pessimistic_state`
- `test_deregister_called_on_offline_entry`
- `test_reregister_called_on_online_entry`
### Integration
- `test_demo_unplug_triggers_banner_within_5s` β€” simulate WAN drop with `iptables` rule, observe state change
---
## 9. Cross-references
| What | Where |
|------|-------|
| Online/offline pubsub topic (local) | [CONTRACT Β§8](../CAPABILITY_CONTRACT.md) |
| LLM internet-dependent backends | [M04 Β§4.3](M04-llm.md) |
| Discovery cadence change | [M02 Β§4.3](M02-discovery.md) |
| UI banner | [M08 Β§5.5](M08-ui.md) |
---
## 10. Open questions
1. **Captive portal detection** β€” Phase 2: probe a known-content URL and compare body hash. MVP: false positives accepted.
2. **IPv6-only networks** β€” current probes are dual-stack via OS. Should work; not yet tested.
3. **Custom probe scripts** β€” Phase 2: let users add their own targets.