Spaces:
Running on Zero
X03 β Observability
Spec version: v1.0 Depends on: X04 (config), stdlib, prometheus_client (optional) Depended on by: every module that does I/O
1. Responsibility
Provides logging, metrics, tracing, and self-diagnostics. No module imports logging directly; they import get_logger(__name__) from this module.
2. File layout
hearthnet/observability/
βββ __init__.py # exports: get_logger, metrics, trace, doctor
βββ logging.py # structured JSON logging
βββ metrics.py # Prometheus-compatible counters/histograms
βββ tracing.py # per-request trace IDs + ring buffer
βββ doctor.py # self-diagnostics
3. Logging
3.1 Public API
# hearthnet.observability.logging
def configure(config: ObservabilityConfig) -> None:
"""Install handlers, formatters, rotation. Idempotent. Call once at startup."""
def get_logger(name: str) -> Logger:
"""Return a stdlib logger configured to emit JSON lines.
Convention: name = module's __name__ (e.g. 'hearthnet.bus.router')."""
class JsonFormatter(logging.Formatter):
"""Renders LogRecords as one-line JSON: ts, level, logger, msg, **extras."""
3.2 Conventions
- Use
extra=to attach structured fields:log.info("routed", extra={"capability": "llm.chat", "to": node_id, "ms": 12}) - Never
f"log message {variable}"for production diagnostics; use structured fields instead - Log levels:
debugβ internal state, only useful with--debuginfoβ meaningful protocol events (manifest received, capability registered, peer joined)warningβ recoverable problem (signature failed, peer unreachable, quarantine)errorβ unexpected failure (exception caught, service crash)
- Exceptions: always
log.exception("what happened", extra={...})β captures traceback automatically - Rate-limit noisy warnings:
RateLimitedLoggerwrapper, log at most once per second per (logger, message_key)
3.3 On-disk format
{"ts":"2026-05-26T08:14:22.281Z","level":"info","logger":"hearthnet.bus.router","msg":"routed","trace_id":"01HXR...","capability":"llm.chat","to":"7H4G-...","ms":12}
One line per event. Files rotate daily at midnight UTC. Retention: LOG_RETENTION_DAYS = 14 from constants.
4. Metrics
4.1 Public API
# hearthnet.observability.metrics
def configure(config: ObservabilityConfig) -> None:
"""Set up registries, start the metrics endpoint if enabled."""
# Counter / histogram / gauge factory functions:
def counter(name: str, doc: str, labels: list[str] = []) -> Counter
def histogram(name: str, doc: str, labels: list[str] = [], buckets: list[float] | None = None) -> Histogram
def gauge(name: str, doc: str, labels: list[str] = []) -> Gauge
# Convenience for "everything else returns None when metrics disabled":
def disabled() -> bool
4.2 Standard metric set
hearthnet_requests_total{capability, result} counter
hearthnet_request_duration_ms{capability, quantile} histogram
hearthnet_active_streams{capability} gauge
hearthnet_nodes_online{community} gauge
hearthnet_event_log_size{community} gauge
hearthnet_event_log_lamport_head{community} gauge
hearthnet_emergency_mode{state} gauge // 0 or 1
hearthnet_blob_storage_bytes gauge
hearthnet_llm_tokens_generated_total{model, backend} counter
hearthnet_llm_concurrent{model} gauge
hearthnet_capability_health_success_rate{capability, node} gauge
hearthnet_rate_limited_total{capability, reason} counter
hearthnet_signature_failures_total{reason} counter
4.3 Scrape endpoint
GET /metrics on the transport server (port 7080). Plain text, Prometheus format. No auth β same trust domain as the rest of the bus.
5. Tracing
5.1 Public API
# hearthnet.observability.tracing
class Trace:
trace_id: str
capability: str
started_at: float # monotonic seconds
spans: list[Span]
@contextmanager
def span(name: str, **extras) -> Iterator[Span]:
"""Open a sub-span on the current trace. Auto-closes."""
def new_trace(capability: str) -> Trace:
"""Start a new trace (typically at the top of a capability handler)."""
def current_trace() -> Trace | None:
"""Get the trace attached to the current asyncio task."""
def attach(trace: Trace) -> None:
"""Attach a trace to the current task. Used by transport when it receives a request with an X-HearthNet-Request-Id."""
def detach() -> None:
"""End the trace; record to the ring buffer; emit done log."""
def get_recent(n: int = 100) -> list[Trace]:
"""Return last N completed traces from the ring buffer (used by /trace endpoint)."""
5.2 Storage
Ring buffer in memory, TRACE_RING_BUFFER = 10000 from constants. Optionally exported to OpenTelemetry in Phase 2.
5.3 Trace IDs are ULIDs
ULIDs are used because they sort by time and need no separate timestamp field.
6. Doctor
6.1 Public API
# hearthnet.observability.doctor
@dataclass
class CheckResult:
name: str
ok: bool
detail: str
fix: str | None
def run_all(config: Config, bus: CapabilityBus) -> list[CheckResult]:
"""Run every check; return list of results."""
def run_one(name: str, config: Config, bus: CapabilityBus) -> CheckResult:
"""Run a single named check."""
# Each check is a registered function:
def register(name: str, check: Callable[[Config, CapabilityBus], CheckResult]) -> None
6.2 Standard checks
| Name | Verifies |
|---|---|
keys_present |
Device key file exists, has 0600 permissions |
keys_loadable |
Keys parse as Ed25519 |
community_present |
Community manifest exists |
event_log_writable |
SQLite open and writable |
mdns_socket |
mDNS socket can bind |
udp_multicast |
UDP discovery socket can bind |
transport_port |
Bus port is free or owned by us |
at_least_one_capability |
Bus has registered β₯ 1 capability |
disk_space |
Free space β₯ 1 GB |
clock_sanity |
System clock within Β±60s of HTTP-reachable anchor (only when internet up) |
llm_backend_reachable |
At least one LLM backend responds |
recent_error_rate |
Last 100 traces have < 20% error rate |
6.3 CLI integration
hearthnet doctor runs run_all, prints a coloured report, exits non-zero on any failure. See M12.
7. Tests
test_logger_writes_json_linesβ assert each line parses as JSON with expected fieldstest_metrics_endpoint_formatβ Prometheus text format conformstest_trace_context_propagationβattach/detachround-trips acrossasyncio.gathertest_doctor_all_pass_on_default_configβrun_allreturns all-OK on fresh inittest_doctor_keys_missingβ failure case forkeys_present
8. References
- Config: X04 Β§3
- Trace IDs propagate via CONTRACT Β§5.1
X-HearthNet-Request-Id - Bus emits trace events: M03 Β§5.6