Spaces:
Running on Zero
Running on Zero
File size: 7,373 Bytes
6f9a5fd | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 | # X03 β Observability
**Spec version:** v1.0
**Depends on:** X04 (config), stdlib, prometheus_client (optional)
**Depended on by:** every module that does I/O
---
## 1. Responsibility
Provides logging, metrics, tracing, and self-diagnostics. No module imports `logging` directly; they import `get_logger(__name__)` from this module.
---
## 2. File layout
```
hearthnet/observability/
βββ __init__.py # exports: get_logger, metrics, trace, doctor
βββ logging.py # structured JSON logging
βββ metrics.py # Prometheus-compatible counters/histograms
βββ tracing.py # per-request trace IDs + ring buffer
βββ doctor.py # self-diagnostics
```
---
## 3. Logging
### 3.1 Public API
```python
# hearthnet.observability.logging
def configure(config: ObservabilityConfig) -> None:
"""Install handlers, formatters, rotation. Idempotent. Call once at startup."""
def get_logger(name: str) -> Logger:
"""Return a stdlib logger configured to emit JSON lines.
Convention: name = module's __name__ (e.g. 'hearthnet.bus.router')."""
class JsonFormatter(logging.Formatter):
"""Renders LogRecords as one-line JSON: ts, level, logger, msg, **extras."""
```
### 3.2 Conventions
- Use `extra=` to attach structured fields: `log.info("routed", extra={"capability": "llm.chat", "to": node_id, "ms": 12})`
- Never `f"log message {variable}"` for production diagnostics; use structured fields instead
- Log levels:
- `debug` β internal state, only useful with `--debug`
- `info` β meaningful protocol events (manifest received, capability registered, peer joined)
- `warning` β recoverable problem (signature failed, peer unreachable, quarantine)
- `error` β unexpected failure (exception caught, service crash)
- Exceptions: always `log.exception("what happened", extra={...})` β captures traceback automatically
- Rate-limit noisy warnings: `RateLimitedLogger` wrapper, log at most once per second per (logger, message_key)
### 3.3 On-disk format
```json
{"ts":"2026-05-26T08:14:22.281Z","level":"info","logger":"hearthnet.bus.router","msg":"routed","trace_id":"01HXR...","capability":"llm.chat","to":"7H4G-...","ms":12}
```
One line per event. Files rotate daily at midnight UTC. Retention: `LOG_RETENTION_DAYS = 14` from constants.
---
## 4. Metrics
### 4.1 Public API
```python
# hearthnet.observability.metrics
def configure(config: ObservabilityConfig) -> None:
"""Set up registries, start the metrics endpoint if enabled."""
# Counter / histogram / gauge factory functions:
def counter(name: str, doc: str, labels: list[str] = []) -> Counter
def histogram(name: str, doc: str, labels: list[str] = [], buckets: list[float] | None = None) -> Histogram
def gauge(name: str, doc: str, labels: list[str] = []) -> Gauge
# Convenience for "everything else returns None when metrics disabled":
def disabled() -> bool
```
### 4.2 Standard metric set
```
hearthnet_requests_total{capability, result} counter
hearthnet_request_duration_ms{capability, quantile} histogram
hearthnet_active_streams{capability} gauge
hearthnet_nodes_online{community} gauge
hearthnet_event_log_size{community} gauge
hearthnet_event_log_lamport_head{community} gauge
hearthnet_emergency_mode{state} gauge // 0 or 1
hearthnet_blob_storage_bytes gauge
hearthnet_llm_tokens_generated_total{model, backend} counter
hearthnet_llm_concurrent{model} gauge
hearthnet_capability_health_success_rate{capability, node} gauge
hearthnet_rate_limited_total{capability, reason} counter
hearthnet_signature_failures_total{reason} counter
```
### 4.3 Scrape endpoint
`GET /metrics` on the transport server (port 7080). Plain text, Prometheus format. No auth β same trust domain as the rest of the bus.
---
## 5. Tracing
### 5.1 Public API
```python
# hearthnet.observability.tracing
class Trace:
trace_id: str
capability: str
started_at: float # monotonic seconds
spans: list[Span]
@contextmanager
def span(name: str, **extras) -> Iterator[Span]:
"""Open a sub-span on the current trace. Auto-closes."""
def new_trace(capability: str) -> Trace:
"""Start a new trace (typically at the top of a capability handler)."""
def current_trace() -> Trace | None:
"""Get the trace attached to the current asyncio task."""
def attach(trace: Trace) -> None:
"""Attach a trace to the current task. Used by transport when it receives a request with an X-HearthNet-Request-Id."""
def detach() -> None:
"""End the trace; record to the ring buffer; emit done log."""
def get_recent(n: int = 100) -> list[Trace]:
"""Return last N completed traces from the ring buffer (used by /trace endpoint)."""
```
### 5.2 Storage
Ring buffer in memory, `TRACE_RING_BUFFER = 10000` from constants. Optionally exported to OpenTelemetry in Phase 2.
### 5.3 Trace IDs are ULIDs
ULIDs are used because they sort by time and need no separate timestamp field.
---
## 6. Doctor
### 6.1 Public API
```python
# hearthnet.observability.doctor
@dataclass
class CheckResult:
name: str
ok: bool
detail: str
fix: str | None
def run_all(config: Config, bus: CapabilityBus) -> list[CheckResult]:
"""Run every check; return list of results."""
def run_one(name: str, config: Config, bus: CapabilityBus) -> CheckResult:
"""Run a single named check."""
# Each check is a registered function:
def register(name: str, check: Callable[[Config, CapabilityBus], CheckResult]) -> None
```
### 6.2 Standard checks
| Name | Verifies |
|------|----------|
| `keys_present` | Device key file exists, has 0600 permissions |
| `keys_loadable` | Keys parse as Ed25519 |
| `community_present` | Community manifest exists |
| `event_log_writable` | SQLite open and writable |
| `mdns_socket` | mDNS socket can bind |
| `udp_multicast` | UDP discovery socket can bind |
| `transport_port` | Bus port is free or owned by us |
| `at_least_one_capability` | Bus has registered β₯ 1 capability |
| `disk_space` | Free space β₯ 1 GB |
| `clock_sanity` | System clock within Β±60s of HTTP-reachable anchor (only when internet up) |
| `llm_backend_reachable` | At least one LLM backend responds |
| `recent_error_rate` | Last 100 traces have < 20% error rate |
### 6.3 CLI integration
`hearthnet doctor` runs `run_all`, prints a coloured report, exits non-zero on any failure. See [M12](../modules/M12-cli.md).
---
## 7. Tests
- `test_logger_writes_json_lines` β assert each line parses as JSON with expected fields
- `test_metrics_endpoint_format` β Prometheus text format conforms
- `test_trace_context_propagation` β `attach`/`detach` round-trips across `asyncio.gather`
- `test_doctor_all_pass_on_default_config` β `run_all` returns all-OK on fresh init
- `test_doctor_keys_missing` β failure case for `keys_present`
---
## 8. References
- Config: [X04 Β§3](X04-config.md)
- Trace IDs propagate via [CONTRACT Β§5.1](../CAPABILITY_CONTRACT.md) `X-HearthNet-Request-Id`
- Bus emits trace events: [M03 Β§5.6](../modules/M03-bus.md)
|