Spaces:
Running on Zero
Running on Zero
| # M04 — LLM Service | |
| **Spec version:** v1.0 | |
| **Depends on:** M03 (bus), X04 (config), X03 (observability), backend libs (llama-cpp-python, ollama HTTP, httpx for HTTP backends) | |
| **Depended on by:** M05 (RAG uses llm.complete internally), M08 (UI passes user queries through llm.chat) | |
| --- | |
| ## 1. Responsibility | |
| Provide `llm.chat@1.0` and `llm.complete@1.0`. Wrap multiple inference backends (llama.cpp, Ollama, LM Studio, HF Inference API, Anthropic API, OpenAI-compatible HTTP). Register one capability instance per (backend, model, quant) tuple so the bus can see them as separate routable providers. | |
| --- | |
| ## 2. File layout | |
| ``` | |
| hearthnet/services/llm/ | |
| ├── __init__.py | |
| ├── service.py # LlmService | |
| ├── tokenizers.py # rough token counting per family | |
| └── backends/ | |
| ├── __init__.py | |
| ├── base.py # LlmBackend Protocol | |
| ├── llama_cpp.py # llama-cpp-python in-process | |
| ├── ollama.py # Ollama HTTP at http://localhost:11434 | |
| ├── lmstudio.py # LM Studio HTTP (OpenAI-compatible) | |
| ├── hf_api.py # HuggingFace Inference API | |
| └── anthropic_api.py # Anthropic Messages API | |
| ``` | |
| --- | |
| ## 3. Public API | |
| ### 3.1 `backends/base.py` | |
| ```python | |
| # hearthnet/services/llm/backends/base.py | |
| from dataclasses import dataclass | |
| from typing import AsyncIterator, Protocol | |
| @dataclass(frozen=True) | |
| class Token: | |
| text: str | |
| logprob: float | None | |
| stop: bool | |
| @dataclass(frozen=True) | |
| class ChatResult: | |
| text: str | |
| tokens_in: int | |
| tokens_out: int | |
| stop_reason: str # "end" | "max_tokens" | "stop_sequence" | "cancelled" | |
| ms: int | |
| @dataclass(frozen=True) | |
| class BackendModel: | |
| """One model an LlmBackend can serve.""" | |
| name: str # "qwen2.5-7b-instruct" | |
| quant: str # "q4_k_m", "q8_0", "fp16", "api" | |
| ctx_max: int # 8192 | |
| modalities: list[str] # ["text"] or ["text", "vision"] | |
| requires_internet: bool # API backends → True; local → False | |
| class LlmBackend(Protocol): | |
| """Abstract backend. Implementations cover one provider.""" | |
| name: str # "llama_cpp" | "ollama" | ... | |
| models: list[BackendModel] | |
| async def warm(self, model: str) -> None: ... | |
| async def close(self) -> None: ... | |
| async def chat( | |
| self, | |
| *, | |
| model: str, | |
| messages: list[dict], | |
| max_tokens: int = 1024, | |
| temperature: float = 0.7, | |
| top_p: float = 0.95, | |
| stop: list[str] | None = None, | |
| seed: int | None = None, | |
| stream: bool = True, | |
| ) -> AsyncIterator[Token]: | |
| """Yields Tokens. The final Token has stop=True.""" | |
| async def complete( | |
| self, | |
| *, | |
| model: str, | |
| prompt: str, | |
| max_tokens: int = 256, | |
| temperature: float = 0.7, | |
| top_p: float = 0.95, | |
| stop: list[str] | None = None, | |
| seed: int | None = None, | |
| stream: bool = True, | |
| ) -> AsyncIterator[Token]: ... | |
| def count_tokens(self, model: str, text: str) -> int: | |
| """Approximate token count; uses a per-model tokenizer if available.""" | |
| def max_concurrent(self, model: str) -> int: | |
| """Backend-specific concurrency limit. Used in capability descriptor.""" | |
| def health(self) -> dict: ... | |
| ``` | |
| ### 3.2 Concrete backends | |
| ```python | |
| # hearthnet/services/llm/backends/llama_cpp.py | |
| class LlamaCppBackend(LlmBackend): | |
| """In-process llama-cpp-python. Loads one model at a time per instance. | |
| Multiple LlamaCppBackend instances may coexist if VRAM allows.""" | |
| def __init__(self, model_path: Path, model_meta: BackendModel, gpu_layers: int = -1): | |
| ... | |
| # hearthnet/services/llm/backends/ollama.py | |
| class OllamaBackend(LlmBackend): | |
| """HTTP-based Ollama at http://localhost:11434 (or remote).""" | |
| def __init__(self, base_url: str = "http://localhost:11434", models: list[str] | None = None): | |
| """If models is None, discover via GET /api/tags.""" | |
| # hearthnet/services/llm/backends/lmstudio.py | |
| class LmStudioBackend(LlmBackend): | |
| """OpenAI-compatible HTTP at http://host:1234. | |
| Used in Christof's home setup at 192.168.188.25:1234.""" | |
| def __init__(self, base_url: str, default_model: str): ... | |
| # hearthnet/services/llm/backends/hf_api.py | |
| class HfApiBackend(LlmBackend): | |
| """HuggingFace Inference API. Requires HF_TOKEN env var (declared in config.llm.backends[].api_key_env).""" | |
| def __init__(self, model: str, token_env: str = "HF_TOKEN"): ... | |
| # hearthnet/services/llm/backends/anthropic_api.py | |
| class AnthropicApiBackend(LlmBackend): | |
| """Anthropic Messages API. Phase 1.5; useful when internet up.""" | |
| def __init__(self, model: str = "claude-sonnet-4-6", token_env: str = "ANTHROPIC_API_KEY"): ... | |
| ``` | |
| ### 3.3 `tokenizers.py` | |
| ```python | |
| # hearthnet/services/llm/tokenizers.py | |
| def count_tokens_approx(model_family: str, text: str) -> int: | |
| """Fast heuristic: chars / 3.5 for Latin scripts, /2 for CJK. | |
| Used when no real tokenizer is available.""" | |
| def model_family(model_name: str) -> str: | |
| """'qwen2.5-7b-instruct' → 'qwen', 'llama-3-8b' → 'llama', etc.""" | |
| ``` | |
| ### 3.4 `service.py` | |
| ```python | |
| # hearthnet/services/llm/service.py | |
| class LlmService: | |
| name = "llm" | |
| version = "1.0" | |
| def __init__(self, config: LlmConfig): | |
| self._backends: list[LlmBackend] = self._build_backends(config) | |
| def _build_backends(self, config: LlmConfig) -> list[LlmBackend]: | |
| """Instantiate each declared backend; skip backends that fail to initialise (with warning).""" | |
| def capabilities(self) -> list[tuple[CapabilityDescriptor, Callable, ParamsPredicate]]: | |
| """Emits one (descriptor, handler, predicate) per (backend, model, capability_kind) combo: | |
| - For each backend × each model: one llm.chat entry and one llm.complete entry. | |
| - Each descriptor's params include model, quant, ctx, backend.""" | |
| async def start(self) -> None: | |
| """Warm one backend (the first listed) to avoid cold-start lag on first call.""" | |
| async def stop(self) -> None: ... | |
| def health(self) -> dict: ... | |
| # --- handlers --- | |
| async def handle_chat(self, req: RouteRequest) -> AsyncIterator[dict]: | |
| """Streams SSE frames per CONTRACT §4.1. | |
| Picks the backend from req.body['params']['model'] (matched at routing). | |
| Maps backend Token → SSE 'token' frames; emits 'done' with meta.""" | |
| async def handle_complete(self, req: RouteRequest) -> AsyncIterator[dict]: | |
| """Same shape as chat but for CONTRACT §4.2.""" | |
| ``` | |
| ### 3.5 Capability descriptors | |
| For each `(backend, model)` pair, the service registers: | |
| ```python | |
| # llm.chat instance | |
| CapabilityDescriptor( | |
| name="llm.chat", | |
| version=(1, 0), | |
| stability="stable", | |
| request_schema={...}, # CONTRACT §4.1 schema | |
| response_schema={...}, # for non-stream fallback | |
| stream_schema={ | |
| "oneOf": [ | |
| {"type": "object", "required": ["text"], "properties": {"text": {"type": "string"}, "logprob": {"type": ["number", "null"]}}}, | |
| {"type": "object", "required": ["tokens_out", "stop_reason", "ms"]} # done frame | |
| ] | |
| }, | |
| params={ | |
| "model": "<model.name>", | |
| "quant": "<model.quant>", | |
| "ctx": model.ctx_max, | |
| "backend": "<backend.name>", | |
| "modalities": model.modalities, | |
| }, | |
| max_concurrent=backend.max_concurrent(model.name), | |
| trust_required="member", | |
| timeout_seconds=LLM_GENERATION_DEFAULT_TIMEOUT_SECONDS, | |
| idempotent=False, | |
| ) | |
| ``` | |
| ### 3.6 `params_compatible` predicate | |
| ```python | |
| def params_compatible(offered: dict, requested: dict) -> bool: | |
| # Required match: model. | |
| # Optional match: ctx (caller's must be <= offered). | |
| if requested.get("model") != offered.get("model"): | |
| return False | |
| if "ctx" in requested and requested["ctx"] > offered["ctx"]: | |
| return False | |
| return True | |
| ``` | |
| --- | |
| ## 4. Behaviour | |
| ### 4.1 Multi-backend selection | |
| Multiple backends may serve the same model name (e.g. llama_cpp local + LM Studio remote both offer `qwen2.5-7b-instruct`). They register as separate capability entries. The bus router picks among them by latency/load — no service-internal preference logic. | |
| ### 4.2 Streaming and cancellation | |
| - `handle_chat` is an async generator | |
| - Each backend `Token` becomes one SSE `token` frame | |
| - On client disconnect, the generator is cancelled; the backend's `chat()` async iterator receives `GeneratorExit`, propagates cancellation to the underlying library (llama.cpp: set abort flag; HTTP backends: close connection) | |
| - Cleanup must complete within 200 ms | |
| ### 4.3 Internet-dependent backends | |
| `HfApiBackend` and `AnthropicApiBackend` set `requires_internet=True` on their `BackendModel`. The service still registers them, but the [M09](M09-emergency.md) detector triggers deregistration from the local bus when offline. On restore, they are re-registered. | |
| ### 4.4 Tool calls (Phase 2) | |
| The `tool_call_delta` stream frame in [CONTRACT §4.1](../CAPABILITY_CONTRACT.md) is reserved. Backends that support tool calls (Anthropic, OpenAI, OpenAI-compatible) will emit these in a future version. MVP: ignored / empty. | |
| ### 4.5 Deterministic mode | |
| If `seed` is present in request, backends that support seeded sampling apply it. `llama_cpp` does; HTTP APIs vary. When unsupported, backend still serves but does NOT promise determinism. | |
| ### 4.6 Token counting | |
| Token counts in `meta.tokens_in` / `meta.tokens_out`: | |
| - `llama_cpp`: exact from the model | |
| - HTTP backends with usage in response: exact | |
| - Others: approximate via `tokenizers.count_tokens_approx` | |
| --- | |
| ## 5. Errors | |
| | Condition | Wire code | | |
| |-----------|-----------| | |
| | Unknown model | `not_found` | | |
| | Backend HTTP 5xx | `internal_error` | | |
| | Backend HTTP rate limit | `rate_limited` (forwarded; `retry_after_ms` if available) | | |
| | Empty messages array | `bad_request` | | |
| | Context exceeded | `bad_request` (with message indicating size) | | |
| | Generation timed out | `timeout` | | |
| | Backend crashed mid-stream | emit `error` frame, then close | | |
| --- | |
| ## 6. Configuration | |
| From [X04 §3](../cross-cutting/X04-config.md): | |
| ```toml | |
| [[llm.backends]] | |
| name = "lmstudio" | |
| url = "http://192.168.188.25:1234" | |
| model = "qwen2.5-7b-instruct" | |
| [[llm.backends]] | |
| name = "llama_cpp" | |
| url = "" # local path; see backend | |
| model = "qwen2.5-1.5b-instruct-q4_k_m.gguf" | |
| [[llm.backends]] | |
| name = "anthropic_api" | |
| model = "claude-sonnet-4-6" | |
| api_key_env = "ANTHROPIC_API_KEY" | |
| ``` | |
| Constant: `LLM_GENERATION_DEFAULT_TIMEOUT_SECONDS = 120`. | |
| --- | |
| ## 7. Tests | |
| ### Unit | |
| - `test_capabilities_one_entry_per_model_per_backend` | |
| - `test_handler_chat_emits_token_then_done` | |
| - `test_handler_chat_cancellation_within_200ms` | |
| - `test_params_compatible_model_must_match` | |
| - `test_params_compatible_ctx_upper_bound` | |
| - `test_internet_dependent_backend_deregistered_on_offline` | |
| ### Integration | |
| - `test_lmstudio_backend_streams_real_tokens` (requires LM Studio at the configured address; skip otherwise) | |
| - `test_three_node_llm_load_balance` | |
| - `test_remote_call_through_bus_returns_full_response` | |
| --- | |
| ## 8. Cross-references | |
| | What | Where | | |
| |------|-------| | |
| | `llm.chat@1.0` wire | [CONTRACT §4.1](../CAPABILITY_CONTRACT.md) | | |
| | `llm.complete@1.0` wire | [CONTRACT §4.2](../CAPABILITY_CONTRACT.md) | | |
| | Service protocol | [M03 §4](M03-bus.md) | | |
| | Streaming format | [CONTRACT §5.3](../CAPABILITY_CONTRACT.md), [X01 §6](../cross-cutting/X01-transport.md) | | |
| | Used by RAG | [M05 §5](M05-rag.md) | | |
| | Emergency mode deregistration | [M09 §5](M09-emergency.md) | | |
| --- | |
| ## 9. Open questions | |
| 1. **Vision models** — Phase 2; reserved `modalities: ['text','vision']`. Request schema gains `messages[].content[].type='image_url'`. | |
| 2. **Tool calls** — Phase 2; reserved frame `tool_call_delta`. Will integrate Anthropic + OpenAI styles. | |
| 3. **Local model autodiscovery** — should `llama_cpp` backend scan a models directory? Useful but easy to defer. | |
| 4. **Per-model preset profiles** — Phase 2: bind a `system_prompt_template` to a model. Not yet. | |