HearthNet-Nemotron / docs /X03-observability.md
Chris4K's picture
prd splitted + contracts
6f9a5fd
|
Raw
History Blame
7.37 kB

X03 β€” Observability

Spec version: v1.0 Depends on: X04 (config), stdlib, prometheus_client (optional) Depended on by: every module that does I/O


1. Responsibility

Provides logging, metrics, tracing, and self-diagnostics. No module imports logging directly; they import get_logger(__name__) from this module.


2. File layout

hearthnet/observability/
β”œβ”€β”€ __init__.py        # exports: get_logger, metrics, trace, doctor
β”œβ”€β”€ logging.py         # structured JSON logging
β”œβ”€β”€ metrics.py         # Prometheus-compatible counters/histograms
β”œβ”€β”€ tracing.py         # per-request trace IDs + ring buffer
└── doctor.py          # self-diagnostics

3. Logging

3.1 Public API

# hearthnet.observability.logging

def configure(config: ObservabilityConfig) -> None:
    """Install handlers, formatters, rotation. Idempotent. Call once at startup."""

def get_logger(name: str) -> Logger:
    """Return a stdlib logger configured to emit JSON lines.
       Convention: name = module's __name__ (e.g. 'hearthnet.bus.router')."""

class JsonFormatter(logging.Formatter):
    """Renders LogRecords as one-line JSON: ts, level, logger, msg, **extras."""

3.2 Conventions

  • Use extra= to attach structured fields: log.info("routed", extra={"capability": "llm.chat", "to": node_id, "ms": 12})
  • Never f"log message {variable}" for production diagnostics; use structured fields instead
  • Log levels:
    • debug β€” internal state, only useful with --debug
    • info β€” meaningful protocol events (manifest received, capability registered, peer joined)
    • warning β€” recoverable problem (signature failed, peer unreachable, quarantine)
    • error β€” unexpected failure (exception caught, service crash)
  • Exceptions: always log.exception("what happened", extra={...}) β€” captures traceback automatically
  • Rate-limit noisy warnings: RateLimitedLogger wrapper, log at most once per second per (logger, message_key)

3.3 On-disk format

{"ts":"2026-05-26T08:14:22.281Z","level":"info","logger":"hearthnet.bus.router","msg":"routed","trace_id":"01HXR...","capability":"llm.chat","to":"7H4G-...","ms":12}

One line per event. Files rotate daily at midnight UTC. Retention: LOG_RETENTION_DAYS = 14 from constants.


4. Metrics

4.1 Public API

# hearthnet.observability.metrics

def configure(config: ObservabilityConfig) -> None:
    """Set up registries, start the metrics endpoint if enabled."""

# Counter / histogram / gauge factory functions:
def counter(name: str, doc: str, labels: list[str] = []) -> Counter
def histogram(name: str, doc: str, labels: list[str] = [], buckets: list[float] | None = None) -> Histogram
def gauge(name: str, doc: str, labels: list[str] = []) -> Gauge

# Convenience for "everything else returns None when metrics disabled":
def disabled() -> bool

4.2 Standard metric set

hearthnet_requests_total{capability, result}                     counter
hearthnet_request_duration_ms{capability, quantile}              histogram
hearthnet_active_streams{capability}                             gauge
hearthnet_nodes_online{community}                                gauge
hearthnet_event_log_size{community}                              gauge
hearthnet_event_log_lamport_head{community}                      gauge
hearthnet_emergency_mode{state}                                  gauge   // 0 or 1
hearthnet_blob_storage_bytes                                     gauge
hearthnet_llm_tokens_generated_total{model, backend}             counter
hearthnet_llm_concurrent{model}                                  gauge
hearthnet_capability_health_success_rate{capability, node}       gauge
hearthnet_rate_limited_total{capability, reason}                 counter
hearthnet_signature_failures_total{reason}                       counter

4.3 Scrape endpoint

GET /metrics on the transport server (port 7080). Plain text, Prometheus format. No auth β€” same trust domain as the rest of the bus.


5. Tracing

5.1 Public API

# hearthnet.observability.tracing

class Trace:
    trace_id:   str
    capability: str
    started_at: float       # monotonic seconds
    spans:      list[Span]

@contextmanager
def span(name: str, **extras) -> Iterator[Span]:
    """Open a sub-span on the current trace. Auto-closes."""

def new_trace(capability: str) -> Trace:
    """Start a new trace (typically at the top of a capability handler)."""

def current_trace() -> Trace | None:
    """Get the trace attached to the current asyncio task."""

def attach(trace: Trace) -> None:
    """Attach a trace to the current task. Used by transport when it receives a request with an X-HearthNet-Request-Id."""

def detach() -> None:
    """End the trace; record to the ring buffer; emit done log."""

def get_recent(n: int = 100) -> list[Trace]:
    """Return last N completed traces from the ring buffer (used by /trace endpoint)."""

5.2 Storage

Ring buffer in memory, TRACE_RING_BUFFER = 10000 from constants. Optionally exported to OpenTelemetry in Phase 2.

5.3 Trace IDs are ULIDs

ULIDs are used because they sort by time and need no separate timestamp field.


6. Doctor

6.1 Public API

# hearthnet.observability.doctor

@dataclass
class CheckResult:
    name:    str
    ok:      bool
    detail:  str
    fix:     str | None

def run_all(config: Config, bus: CapabilityBus) -> list[CheckResult]:
    """Run every check; return list of results."""

def run_one(name: str, config: Config, bus: CapabilityBus) -> CheckResult:
    """Run a single named check."""

# Each check is a registered function:
def register(name: str, check: Callable[[Config, CapabilityBus], CheckResult]) -> None

6.2 Standard checks

Name Verifies
keys_present Device key file exists, has 0600 permissions
keys_loadable Keys parse as Ed25519
community_present Community manifest exists
event_log_writable SQLite open and writable
mdns_socket mDNS socket can bind
udp_multicast UDP discovery socket can bind
transport_port Bus port is free or owned by us
at_least_one_capability Bus has registered β‰₯ 1 capability
disk_space Free space β‰₯ 1 GB
clock_sanity System clock within Β±60s of HTTP-reachable anchor (only when internet up)
llm_backend_reachable At least one LLM backend responds
recent_error_rate Last 100 traces have < 20% error rate

6.3 CLI integration

hearthnet doctor runs run_all, prints a coloured report, exits non-zero on any failure. See M12.


7. Tests

  • test_logger_writes_json_lines β€” assert each line parses as JSON with expected fields
  • test_metrics_endpoint_format β€” Prometheus text format conforms
  • test_trace_context_propagation β€” attach/detach round-trips across asyncio.gather
  • test_doctor_all_pass_on_default_config β€” run_all returns all-OK on fresh init
  • test_doctor_keys_missing β€” failure case for keys_present

8. References