Spaces:
Running on Zero
Running on Zero
File size: 12,182 Bytes
6f9a5fd | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 | # M04 β LLM Service
**Spec version:** v1.0
**Depends on:** M03 (bus), X04 (config), X03 (observability), backend libs (llama-cpp-python, ollama HTTP, httpx for HTTP backends)
**Depended on by:** M05 (RAG uses llm.complete internally), M08 (UI passes user queries through llm.chat)
---
## 1. Responsibility
Provide `llm.chat@1.0` and `llm.complete@1.0`. Wrap multiple inference backends (llama.cpp, Ollama, LM Studio, HF Inference API, Anthropic API, OpenAI-compatible HTTP). Register one capability instance per (backend, model, quant) tuple so the bus can see them as separate routable providers.
---
## 2. File layout
```
hearthnet/services/llm/
βββ __init__.py
βββ service.py # LlmService
βββ tokenizers.py # rough token counting per family
βββ backends/
βββ __init__.py
βββ base.py # LlmBackend Protocol
βββ llama_cpp.py # llama-cpp-python in-process
βββ ollama.py # Ollama HTTP at http://localhost:11434
βββ lmstudio.py # LM Studio HTTP (OpenAI-compatible)
βββ hf_api.py # HuggingFace Inference API
βββ anthropic_api.py # Anthropic Messages API
```
---
## 3. Public API
### 3.1 `backends/base.py`
```python
# hearthnet/services/llm/backends/base.py
from dataclasses import dataclass
from typing import AsyncIterator, Protocol
@dataclass(frozen=True)
class Token:
text: str
logprob: float | None
stop: bool
@dataclass(frozen=True)
class ChatResult:
text: str
tokens_in: int
tokens_out: int
stop_reason: str # "end" | "max_tokens" | "stop_sequence" | "cancelled"
ms: int
@dataclass(frozen=True)
class BackendModel:
"""One model an LlmBackend can serve."""
name: str # "qwen2.5-7b-instruct"
quant: str # "q4_k_m", "q8_0", "fp16", "api"
ctx_max: int # 8192
modalities: list[str] # ["text"] or ["text", "vision"]
requires_internet: bool # API backends β True; local β False
class LlmBackend(Protocol):
"""Abstract backend. Implementations cover one provider."""
name: str # "llama_cpp" | "ollama" | ...
models: list[BackendModel]
async def warm(self, model: str) -> None: ...
async def close(self) -> None: ...
async def chat(
self,
*,
model: str,
messages: list[dict],
max_tokens: int = 1024,
temperature: float = 0.7,
top_p: float = 0.95,
stop: list[str] | None = None,
seed: int | None = None,
stream: bool = True,
) -> AsyncIterator[Token]:
"""Yields Tokens. The final Token has stop=True."""
async def complete(
self,
*,
model: str,
prompt: str,
max_tokens: int = 256,
temperature: float = 0.7,
top_p: float = 0.95,
stop: list[str] | None = None,
seed: int | None = None,
stream: bool = True,
) -> AsyncIterator[Token]: ...
def count_tokens(self, model: str, text: str) -> int:
"""Approximate token count; uses a per-model tokenizer if available."""
def max_concurrent(self, model: str) -> int:
"""Backend-specific concurrency limit. Used in capability descriptor."""
def health(self) -> dict: ...
```
### 3.2 Concrete backends
```python
# hearthnet/services/llm/backends/llama_cpp.py
class LlamaCppBackend(LlmBackend):
"""In-process llama-cpp-python. Loads one model at a time per instance.
Multiple LlamaCppBackend instances may coexist if VRAM allows."""
def __init__(self, model_path: Path, model_meta: BackendModel, gpu_layers: int = -1):
...
# hearthnet/services/llm/backends/ollama.py
class OllamaBackend(LlmBackend):
"""HTTP-based Ollama at http://localhost:11434 (or remote)."""
def __init__(self, base_url: str = "http://localhost:11434", models: list[str] | None = None):
"""If models is None, discover via GET /api/tags."""
# hearthnet/services/llm/backends/lmstudio.py
class LmStudioBackend(LlmBackend):
"""OpenAI-compatible HTTP at http://host:1234.
Used in Christof's home setup at 192.168.188.25:1234."""
def __init__(self, base_url: str, default_model: str): ...
# hearthnet/services/llm/backends/hf_api.py
class HfApiBackend(LlmBackend):
"""HuggingFace Inference API. Requires HF_TOKEN env var (declared in config.llm.backends[].api_key_env)."""
def __init__(self, model: str, token_env: str = "HF_TOKEN"): ...
# hearthnet/services/llm/backends/anthropic_api.py
class AnthropicApiBackend(LlmBackend):
"""Anthropic Messages API. Phase 1.5; useful when internet up."""
def __init__(self, model: str = "claude-sonnet-4-6", token_env: str = "ANTHROPIC_API_KEY"): ...
```
### 3.3 `tokenizers.py`
```python
# hearthnet/services/llm/tokenizers.py
def count_tokens_approx(model_family: str, text: str) -> int:
"""Fast heuristic: chars / 3.5 for Latin scripts, /2 for CJK.
Used when no real tokenizer is available."""
def model_family(model_name: str) -> str:
"""'qwen2.5-7b-instruct' β 'qwen', 'llama-3-8b' β 'llama', etc."""
```
### 3.4 `service.py`
```python
# hearthnet/services/llm/service.py
class LlmService:
name = "llm"
version = "1.0"
def __init__(self, config: LlmConfig):
self._backends: list[LlmBackend] = self._build_backends(config)
def _build_backends(self, config: LlmConfig) -> list[LlmBackend]:
"""Instantiate each declared backend; skip backends that fail to initialise (with warning)."""
def capabilities(self) -> list[tuple[CapabilityDescriptor, Callable, ParamsPredicate]]:
"""Emits one (descriptor, handler, predicate) per (backend, model, capability_kind) combo:
- For each backend Γ each model: one llm.chat entry and one llm.complete entry.
- Each descriptor's params include model, quant, ctx, backend."""
async def start(self) -> None:
"""Warm one backend (the first listed) to avoid cold-start lag on first call."""
async def stop(self) -> None: ...
def health(self) -> dict: ...
# --- handlers ---
async def handle_chat(self, req: RouteRequest) -> AsyncIterator[dict]:
"""Streams SSE frames per CONTRACT Β§4.1.
Picks the backend from req.body['params']['model'] (matched at routing).
Maps backend Token β SSE 'token' frames; emits 'done' with meta."""
async def handle_complete(self, req: RouteRequest) -> AsyncIterator[dict]:
"""Same shape as chat but for CONTRACT Β§4.2."""
```
### 3.5 Capability descriptors
For each `(backend, model)` pair, the service registers:
```python
# llm.chat instance
CapabilityDescriptor(
name="llm.chat",
version=(1, 0),
stability="stable",
request_schema={...}, # CONTRACT Β§4.1 schema
response_schema={...}, # for non-stream fallback
stream_schema={
"oneOf": [
{"type": "object", "required": ["text"], "properties": {"text": {"type": "string"}, "logprob": {"type": ["number", "null"]}}},
{"type": "object", "required": ["tokens_out", "stop_reason", "ms"]} # done frame
]
},
params={
"model": "<model.name>",
"quant": "<model.quant>",
"ctx": model.ctx_max,
"backend": "<backend.name>",
"modalities": model.modalities,
},
max_concurrent=backend.max_concurrent(model.name),
trust_required="member",
timeout_seconds=LLM_GENERATION_DEFAULT_TIMEOUT_SECONDS,
idempotent=False,
)
```
### 3.6 `params_compatible` predicate
```python
def params_compatible(offered: dict, requested: dict) -> bool:
# Required match: model.
# Optional match: ctx (caller's must be <= offered).
if requested.get("model") != offered.get("model"):
return False
if "ctx" in requested and requested["ctx"] > offered["ctx"]:
return False
return True
```
---
## 4. Behaviour
### 4.1 Multi-backend selection
Multiple backends may serve the same model name (e.g. llama_cpp local + LM Studio remote both offer `qwen2.5-7b-instruct`). They register as separate capability entries. The bus router picks among them by latency/load β no service-internal preference logic.
### 4.2 Streaming and cancellation
- `handle_chat` is an async generator
- Each backend `Token` becomes one SSE `token` frame
- On client disconnect, the generator is cancelled; the backend's `chat()` async iterator receives `GeneratorExit`, propagates cancellation to the underlying library (llama.cpp: set abort flag; HTTP backends: close connection)
- Cleanup must complete within 200 ms
### 4.3 Internet-dependent backends
`HfApiBackend` and `AnthropicApiBackend` set `requires_internet=True` on their `BackendModel`. The service still registers them, but the [M09](M09-emergency.md) detector triggers deregistration from the local bus when offline. On restore, they are re-registered.
### 4.4 Tool calls (Phase 2)
The `tool_call_delta` stream frame in [CONTRACT Β§4.1](../CAPABILITY_CONTRACT.md) is reserved. Backends that support tool calls (Anthropic, OpenAI, OpenAI-compatible) will emit these in a future version. MVP: ignored / empty.
### 4.5 Deterministic mode
If `seed` is present in request, backends that support seeded sampling apply it. `llama_cpp` does; HTTP APIs vary. When unsupported, backend still serves but does NOT promise determinism.
### 4.6 Token counting
Token counts in `meta.tokens_in` / `meta.tokens_out`:
- `llama_cpp`: exact from the model
- HTTP backends with usage in response: exact
- Others: approximate via `tokenizers.count_tokens_approx`
---
## 5. Errors
| Condition | Wire code |
|-----------|-----------|
| Unknown model | `not_found` |
| Backend HTTP 5xx | `internal_error` |
| Backend HTTP rate limit | `rate_limited` (forwarded; `retry_after_ms` if available) |
| Empty messages array | `bad_request` |
| Context exceeded | `bad_request` (with message indicating size) |
| Generation timed out | `timeout` |
| Backend crashed mid-stream | emit `error` frame, then close |
---
## 6. Configuration
From [X04 Β§3](../cross-cutting/X04-config.md):
```toml
[[llm.backends]]
name = "lmstudio"
url = "http://192.168.188.25:1234"
model = "qwen2.5-7b-instruct"
[[llm.backends]]
name = "llama_cpp"
url = "" # local path; see backend
model = "qwen2.5-1.5b-instruct-q4_k_m.gguf"
[[llm.backends]]
name = "anthropic_api"
model = "claude-sonnet-4-6"
api_key_env = "ANTHROPIC_API_KEY"
```
Constant: `LLM_GENERATION_DEFAULT_TIMEOUT_SECONDS = 120`.
---
## 7. Tests
### Unit
- `test_capabilities_one_entry_per_model_per_backend`
- `test_handler_chat_emits_token_then_done`
- `test_handler_chat_cancellation_within_200ms`
- `test_params_compatible_model_must_match`
- `test_params_compatible_ctx_upper_bound`
- `test_internet_dependent_backend_deregistered_on_offline`
### Integration
- `test_lmstudio_backend_streams_real_tokens` (requires LM Studio at the configured address; skip otherwise)
- `test_three_node_llm_load_balance`
- `test_remote_call_through_bus_returns_full_response`
---
## 8. Cross-references
| What | Where |
|------|-------|
| `llm.chat@1.0` wire | [CONTRACT Β§4.1](../CAPABILITY_CONTRACT.md) |
| `llm.complete@1.0` wire | [CONTRACT Β§4.2](../CAPABILITY_CONTRACT.md) |
| Service protocol | [M03 Β§4](M03-bus.md) |
| Streaming format | [CONTRACT Β§5.3](../CAPABILITY_CONTRACT.md), [X01 Β§6](../cross-cutting/X01-transport.md) |
| Used by RAG | [M05 Β§5](M05-rag.md) |
| Emergency mode deregistration | [M09 Β§5](M09-emergency.md) |
---
## 9. Open questions
1. **Vision models** β Phase 2; reserved `modalities: ['text','vision']`. Request schema gains `messages[].content[].type='image_url'`.
2. **Tool calls** β Phase 2; reserved frame `tool_call_delta`. Will integrate Anthropic + OpenAI styles.
3. **Local model autodiscovery** β should `llama_cpp` backend scan a models directory? Useful but easy to defer.
4. **Per-model preset profiles** β Phase 2: bind a `system_prompt_template` to a model. Not yet.
|