File size: 12,182 Bytes
6f9a5fd
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
# M04 β€” LLM Service

**Spec version:** v1.0
**Depends on:** M03 (bus), X04 (config), X03 (observability), backend libs (llama-cpp-python, ollama HTTP, httpx for HTTP backends)
**Depended on by:** M05 (RAG uses llm.complete internally), M08 (UI passes user queries through llm.chat)

---

## 1. Responsibility

Provide `llm.chat@1.0` and `llm.complete@1.0`. Wrap multiple inference backends (llama.cpp, Ollama, LM Studio, HF Inference API, Anthropic API, OpenAI-compatible HTTP). Register one capability instance per (backend, model, quant) tuple so the bus can see them as separate routable providers.

---

## 2. File layout

```
hearthnet/services/llm/
β”œβ”€β”€ __init__.py
β”œβ”€β”€ service.py                 # LlmService
β”œβ”€β”€ tokenizers.py              # rough token counting per family
└── backends/
    β”œβ”€β”€ __init__.py
    β”œβ”€β”€ base.py                # LlmBackend Protocol
    β”œβ”€β”€ llama_cpp.py           # llama-cpp-python in-process
    β”œβ”€β”€ ollama.py              # Ollama HTTP at http://localhost:11434
    β”œβ”€β”€ lmstudio.py            # LM Studio HTTP (OpenAI-compatible)
    β”œβ”€β”€ hf_api.py              # HuggingFace Inference API
    └── anthropic_api.py       # Anthropic Messages API
```

---

## 3. Public API

### 3.1 `backends/base.py`

```python
# hearthnet/services/llm/backends/base.py
from dataclasses import dataclass
from typing import AsyncIterator, Protocol

@dataclass(frozen=True)
class Token:
    text:    str
    logprob: float | None
    stop:    bool

@dataclass(frozen=True)
class ChatResult:
    text:       str
    tokens_in:  int
    tokens_out: int
    stop_reason: str    # "end" | "max_tokens" | "stop_sequence" | "cancelled"
    ms:         int

@dataclass(frozen=True)
class BackendModel:
    """One model an LlmBackend can serve."""
    name:           str       # "qwen2.5-7b-instruct"
    quant:          str       # "q4_k_m", "q8_0", "fp16", "api"
    ctx_max:        int       # 8192
    modalities:     list[str] # ["text"] or ["text", "vision"]
    requires_internet: bool   # API backends β†’ True; local β†’ False

class LlmBackend(Protocol):
    """Abstract backend. Implementations cover one provider."""

    name:       str           # "llama_cpp" | "ollama" | ...
    models:     list[BackendModel]

    async def warm(self, model: str) -> None: ...
    async def close(self) -> None: ...

    async def chat(
        self,
        *,
        model: str,
        messages: list[dict],
        max_tokens: int = 1024,
        temperature: float = 0.7,
        top_p: float = 0.95,
        stop: list[str] | None = None,
        seed: int | None = None,
        stream: bool = True,
    ) -> AsyncIterator[Token]:
        """Yields Tokens. The final Token has stop=True."""

    async def complete(
        self,
        *,
        model: str,
        prompt: str,
        max_tokens: int = 256,
        temperature: float = 0.7,
        top_p: float = 0.95,
        stop: list[str] | None = None,
        seed: int | None = None,
        stream: bool = True,
    ) -> AsyncIterator[Token]: ...

    def count_tokens(self, model: str, text: str) -> int:
        """Approximate token count; uses a per-model tokenizer if available."""

    def max_concurrent(self, model: str) -> int:
        """Backend-specific concurrency limit. Used in capability descriptor."""

    def health(self) -> dict: ...
```

### 3.2 Concrete backends

```python
# hearthnet/services/llm/backends/llama_cpp.py
class LlamaCppBackend(LlmBackend):
    """In-process llama-cpp-python. Loads one model at a time per instance.
       Multiple LlamaCppBackend instances may coexist if VRAM allows."""

    def __init__(self, model_path: Path, model_meta: BackendModel, gpu_layers: int = -1):
        ...

# hearthnet/services/llm/backends/ollama.py
class OllamaBackend(LlmBackend):
    """HTTP-based Ollama at http://localhost:11434 (or remote)."""

    def __init__(self, base_url: str = "http://localhost:11434", models: list[str] | None = None):
        """If models is None, discover via GET /api/tags."""

# hearthnet/services/llm/backends/lmstudio.py
class LmStudioBackend(LlmBackend):
    """OpenAI-compatible HTTP at http://host:1234.
       Used in Christof's home setup at 192.168.188.25:1234."""

    def __init__(self, base_url: str, default_model: str): ...

# hearthnet/services/llm/backends/hf_api.py
class HfApiBackend(LlmBackend):
    """HuggingFace Inference API. Requires HF_TOKEN env var (declared in config.llm.backends[].api_key_env)."""

    def __init__(self, model: str, token_env: str = "HF_TOKEN"): ...

# hearthnet/services/llm/backends/anthropic_api.py
class AnthropicApiBackend(LlmBackend):
    """Anthropic Messages API. Phase 1.5; useful when internet up."""

    def __init__(self, model: str = "claude-sonnet-4-6", token_env: str = "ANTHROPIC_API_KEY"): ...
```

### 3.3 `tokenizers.py`

```python
# hearthnet/services/llm/tokenizers.py
def count_tokens_approx(model_family: str, text: str) -> int:
    """Fast heuristic: chars / 3.5 for Latin scripts, /2 for CJK.
       Used when no real tokenizer is available."""

def model_family(model_name: str) -> str:
    """'qwen2.5-7b-instruct' β†’ 'qwen', 'llama-3-8b' β†’ 'llama', etc."""
```

### 3.4 `service.py`

```python
# hearthnet/services/llm/service.py
class LlmService:
    name    = "llm"
    version = "1.0"

    def __init__(self, config: LlmConfig):
        self._backends: list[LlmBackend] = self._build_backends(config)

    def _build_backends(self, config: LlmConfig) -> list[LlmBackend]:
        """Instantiate each declared backend; skip backends that fail to initialise (with warning)."""

    def capabilities(self) -> list[tuple[CapabilityDescriptor, Callable, ParamsPredicate]]:
        """Emits one (descriptor, handler, predicate) per (backend, model, capability_kind) combo:
        - For each backend Γ— each model: one llm.chat entry and one llm.complete entry.
        - Each descriptor's params include model, quant, ctx, backend."""

    async def start(self) -> None:
        """Warm one backend (the first listed) to avoid cold-start lag on first call."""

    async def stop(self) -> None: ...
    def health(self) -> dict: ...

    # --- handlers ---

    async def handle_chat(self, req: RouteRequest) -> AsyncIterator[dict]:
        """Streams SSE frames per CONTRACT Β§4.1.
        Picks the backend from req.body['params']['model'] (matched at routing).
        Maps backend Token β†’ SSE 'token' frames; emits 'done' with meta."""

    async def handle_complete(self, req: RouteRequest) -> AsyncIterator[dict]:
        """Same shape as chat but for CONTRACT Β§4.2."""
```

### 3.5 Capability descriptors

For each `(backend, model)` pair, the service registers:

```python
# llm.chat instance
CapabilityDescriptor(
    name="llm.chat",
    version=(1, 0),
    stability="stable",
    request_schema={...},        # CONTRACT Β§4.1 schema
    response_schema={...},       # for non-stream fallback
    stream_schema={
        "oneOf": [
            {"type": "object", "required": ["text"], "properties": {"text": {"type": "string"}, "logprob": {"type": ["number", "null"]}}},
            {"type": "object", "required": ["tokens_out", "stop_reason", "ms"]}     # done frame
        ]
    },
    params={
        "model":   "<model.name>",
        "quant":   "<model.quant>",
        "ctx":     model.ctx_max,
        "backend": "<backend.name>",
        "modalities": model.modalities,
    },
    max_concurrent=backend.max_concurrent(model.name),
    trust_required="member",
    timeout_seconds=LLM_GENERATION_DEFAULT_TIMEOUT_SECONDS,
    idempotent=False,
)
```

### 3.6 `params_compatible` predicate

```python
def params_compatible(offered: dict, requested: dict) -> bool:
    # Required match: model.
    # Optional match: ctx (caller's must be <= offered).
    if requested.get("model") != offered.get("model"):
        return False
    if "ctx" in requested and requested["ctx"] > offered["ctx"]:
        return False
    return True
```

---

## 4. Behaviour

### 4.1 Multi-backend selection

Multiple backends may serve the same model name (e.g. llama_cpp local + LM Studio remote both offer `qwen2.5-7b-instruct`). They register as separate capability entries. The bus router picks among them by latency/load β€” no service-internal preference logic.

### 4.2 Streaming and cancellation

- `handle_chat` is an async generator
- Each backend `Token` becomes one SSE `token` frame
- On client disconnect, the generator is cancelled; the backend's `chat()` async iterator receives `GeneratorExit`, propagates cancellation to the underlying library (llama.cpp: set abort flag; HTTP backends: close connection)
- Cleanup must complete within 200 ms

### 4.3 Internet-dependent backends

`HfApiBackend` and `AnthropicApiBackend` set `requires_internet=True` on their `BackendModel`. The service still registers them, but the [M09](M09-emergency.md) detector triggers deregistration from the local bus when offline. On restore, they are re-registered.

### 4.4 Tool calls (Phase 2)

The `tool_call_delta` stream frame in [CONTRACT Β§4.1](../CAPABILITY_CONTRACT.md) is reserved. Backends that support tool calls (Anthropic, OpenAI, OpenAI-compatible) will emit these in a future version. MVP: ignored / empty.

### 4.5 Deterministic mode

If `seed` is present in request, backends that support seeded sampling apply it. `llama_cpp` does; HTTP APIs vary. When unsupported, backend still serves but does NOT promise determinism.

### 4.6 Token counting

Token counts in `meta.tokens_in` / `meta.tokens_out`:
- `llama_cpp`: exact from the model
- HTTP backends with usage in response: exact
- Others: approximate via `tokenizers.count_tokens_approx`

---

## 5. Errors

| Condition | Wire code |
|-----------|-----------|
| Unknown model | `not_found` |
| Backend HTTP 5xx | `internal_error` |
| Backend HTTP rate limit | `rate_limited` (forwarded; `retry_after_ms` if available) |
| Empty messages array | `bad_request` |
| Context exceeded | `bad_request` (with message indicating size) |
| Generation timed out | `timeout` |
| Backend crashed mid-stream | emit `error` frame, then close |

---

## 6. Configuration

From [X04 Β§3](../cross-cutting/X04-config.md):

```toml
[[llm.backends]]
name  = "lmstudio"
url   = "http://192.168.188.25:1234"
model = "qwen2.5-7b-instruct"

[[llm.backends]]
name  = "llama_cpp"
url   = ""                # local path; see backend
model = "qwen2.5-1.5b-instruct-q4_k_m.gguf"

[[llm.backends]]
name        = "anthropic_api"
model       = "claude-sonnet-4-6"
api_key_env = "ANTHROPIC_API_KEY"
```

Constant: `LLM_GENERATION_DEFAULT_TIMEOUT_SECONDS = 120`.

---

## 7. Tests

### Unit
- `test_capabilities_one_entry_per_model_per_backend`
- `test_handler_chat_emits_token_then_done`
- `test_handler_chat_cancellation_within_200ms`
- `test_params_compatible_model_must_match`
- `test_params_compatible_ctx_upper_bound`
- `test_internet_dependent_backend_deregistered_on_offline`

### Integration
- `test_lmstudio_backend_streams_real_tokens` (requires LM Studio at the configured address; skip otherwise)
- `test_three_node_llm_load_balance`
- `test_remote_call_through_bus_returns_full_response`

---

## 8. Cross-references

| What | Where |
|------|-------|
| `llm.chat@1.0` wire | [CONTRACT Β§4.1](../CAPABILITY_CONTRACT.md) |
| `llm.complete@1.0` wire | [CONTRACT Β§4.2](../CAPABILITY_CONTRACT.md) |
| Service protocol | [M03 Β§4](M03-bus.md) |
| Streaming format | [CONTRACT Β§5.3](../CAPABILITY_CONTRACT.md), [X01 Β§6](../cross-cutting/X01-transport.md) |
| Used by RAG | [M05 Β§5](M05-rag.md) |
| Emergency mode deregistration | [M09 Β§5](M09-emergency.md) |

---

## 9. Open questions

1. **Vision models** β€” Phase 2; reserved `modalities: ['text','vision']`. Request schema gains `messages[].content[].type='image_url'`.
2. **Tool calls** β€” Phase 2; reserved frame `tool_call_delta`. Will integrate Anthropic + OpenAI styles.
3. **Local model autodiscovery** β€” should `llama_cpp` backend scan a models directory? Useful but easy to defer.
4. **Per-model preset profiles** β€” Phase 2: bind a `system_prompt_template` to a model. Not yet.