File size: 7,373 Bytes
6f9a5fd
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
# X03 β€” Observability

**Spec version:** v1.0
**Depends on:** X04 (config), stdlib, prometheus_client (optional)
**Depended on by:** every module that does I/O

---

## 1. Responsibility

Provides logging, metrics, tracing, and self-diagnostics. No module imports `logging` directly; they import `get_logger(__name__)` from this module.

---

## 2. File layout

```
hearthnet/observability/
β”œβ”€β”€ __init__.py        # exports: get_logger, metrics, trace, doctor
β”œβ”€β”€ logging.py         # structured JSON logging
β”œβ”€β”€ metrics.py         # Prometheus-compatible counters/histograms
β”œβ”€β”€ tracing.py         # per-request trace IDs + ring buffer
└── doctor.py          # self-diagnostics
```

---

## 3. Logging

### 3.1 Public API

```python
# hearthnet.observability.logging

def configure(config: ObservabilityConfig) -> None:
    """Install handlers, formatters, rotation. Idempotent. Call once at startup."""

def get_logger(name: str) -> Logger:
    """Return a stdlib logger configured to emit JSON lines.
       Convention: name = module's __name__ (e.g. 'hearthnet.bus.router')."""

class JsonFormatter(logging.Formatter):
    """Renders LogRecords as one-line JSON: ts, level, logger, msg, **extras."""
```

### 3.2 Conventions

- Use `extra=` to attach structured fields: `log.info("routed", extra={"capability": "llm.chat", "to": node_id, "ms": 12})`
- Never `f"log message {variable}"` for production diagnostics; use structured fields instead
- Log levels:
  - `debug` β€” internal state, only useful with `--debug`
  - `info` β€” meaningful protocol events (manifest received, capability registered, peer joined)
  - `warning` β€” recoverable problem (signature failed, peer unreachable, quarantine)
  - `error` β€” unexpected failure (exception caught, service crash)
- Exceptions: always `log.exception("what happened", extra={...})` β€” captures traceback automatically
- Rate-limit noisy warnings: `RateLimitedLogger` wrapper, log at most once per second per (logger, message_key)

### 3.3 On-disk format

```json
{"ts":"2026-05-26T08:14:22.281Z","level":"info","logger":"hearthnet.bus.router","msg":"routed","trace_id":"01HXR...","capability":"llm.chat","to":"7H4G-...","ms":12}
```

One line per event. Files rotate daily at midnight UTC. Retention: `LOG_RETENTION_DAYS = 14` from constants.

---

## 4. Metrics

### 4.1 Public API

```python
# hearthnet.observability.metrics

def configure(config: ObservabilityConfig) -> None:
    """Set up registries, start the metrics endpoint if enabled."""

# Counter / histogram / gauge factory functions:
def counter(name: str, doc: str, labels: list[str] = []) -> Counter
def histogram(name: str, doc: str, labels: list[str] = [], buckets: list[float] | None = None) -> Histogram
def gauge(name: str, doc: str, labels: list[str] = []) -> Gauge

# Convenience for "everything else returns None when metrics disabled":
def disabled() -> bool
```

### 4.2 Standard metric set

```
hearthnet_requests_total{capability, result}                     counter
hearthnet_request_duration_ms{capability, quantile}              histogram
hearthnet_active_streams{capability}                             gauge
hearthnet_nodes_online{community}                                gauge
hearthnet_event_log_size{community}                              gauge
hearthnet_event_log_lamport_head{community}                      gauge
hearthnet_emergency_mode{state}                                  gauge   // 0 or 1
hearthnet_blob_storage_bytes                                     gauge
hearthnet_llm_tokens_generated_total{model, backend}             counter
hearthnet_llm_concurrent{model}                                  gauge
hearthnet_capability_health_success_rate{capability, node}       gauge
hearthnet_rate_limited_total{capability, reason}                 counter
hearthnet_signature_failures_total{reason}                       counter
```

### 4.3 Scrape endpoint

`GET /metrics` on the transport server (port 7080). Plain text, Prometheus format. No auth β€” same trust domain as the rest of the bus.

---

## 5. Tracing

### 5.1 Public API

```python
# hearthnet.observability.tracing

class Trace:
    trace_id:   str
    capability: str
    started_at: float       # monotonic seconds
    spans:      list[Span]

@contextmanager
def span(name: str, **extras) -> Iterator[Span]:
    """Open a sub-span on the current trace. Auto-closes."""

def new_trace(capability: str) -> Trace:
    """Start a new trace (typically at the top of a capability handler)."""

def current_trace() -> Trace | None:
    """Get the trace attached to the current asyncio task."""

def attach(trace: Trace) -> None:
    """Attach a trace to the current task. Used by transport when it receives a request with an X-HearthNet-Request-Id."""

def detach() -> None:
    """End the trace; record to the ring buffer; emit done log."""

def get_recent(n: int = 100) -> list[Trace]:
    """Return last N completed traces from the ring buffer (used by /trace endpoint)."""
```

### 5.2 Storage

Ring buffer in memory, `TRACE_RING_BUFFER = 10000` from constants. Optionally exported to OpenTelemetry in Phase 2.

### 5.3 Trace IDs are ULIDs

ULIDs are used because they sort by time and need no separate timestamp field.

---

## 6. Doctor

### 6.1 Public API

```python
# hearthnet.observability.doctor

@dataclass
class CheckResult:
    name:    str
    ok:      bool
    detail:  str
    fix:     str | None

def run_all(config: Config, bus: CapabilityBus) -> list[CheckResult]:
    """Run every check; return list of results."""

def run_one(name: str, config: Config, bus: CapabilityBus) -> CheckResult:
    """Run a single named check."""

# Each check is a registered function:
def register(name: str, check: Callable[[Config, CapabilityBus], CheckResult]) -> None
```

### 6.2 Standard checks

| Name | Verifies |
|------|----------|
| `keys_present` | Device key file exists, has 0600 permissions |
| `keys_loadable` | Keys parse as Ed25519 |
| `community_present` | Community manifest exists |
| `event_log_writable` | SQLite open and writable |
| `mdns_socket` | mDNS socket can bind |
| `udp_multicast` | UDP discovery socket can bind |
| `transport_port` | Bus port is free or owned by us |
| `at_least_one_capability` | Bus has registered β‰₯ 1 capability |
| `disk_space` | Free space β‰₯ 1 GB |
| `clock_sanity` | System clock within Β±60s of HTTP-reachable anchor (only when internet up) |
| `llm_backend_reachable` | At least one LLM backend responds |
| `recent_error_rate` | Last 100 traces have < 20% error rate |

### 6.3 CLI integration

`hearthnet doctor` runs `run_all`, prints a coloured report, exits non-zero on any failure. See [M12](../modules/M12-cli.md).

---

## 7. Tests

- `test_logger_writes_json_lines` β€” assert each line parses as JSON with expected fields
- `test_metrics_endpoint_format` β€” Prometheus text format conforms
- `test_trace_context_propagation` β€” `attach`/`detach` round-trips across `asyncio.gather`
- `test_doctor_all_pass_on_default_config` β€” `run_all` returns all-OK on fresh init
- `test_doctor_keys_missing` β€” failure case for `keys_present`

---

## 8. References

- Config: [X04 Β§3](X04-config.md)
- Trace IDs propagate via [CONTRACT Β§5.1](../CAPABILITY_CONTRACT.md) `X-HearthNet-Request-Id`
- Bus emits trace events: [M03 Β§5.6](../modules/M03-bus.md)