Spaces:
Running on Zero
M04 β LLM Service
Spec version: v1.0 Depends on: M03 (bus), X04 (config), X03 (observability), backend libs (llama-cpp-python, ollama HTTP, httpx for HTTP backends) Depended on by: M05 (RAG uses llm.complete internally), M08 (UI passes user queries through llm.chat)
1. Responsibility
Provide llm.chat@1.0 and llm.complete@1.0. Wrap multiple inference backends (llama.cpp, Ollama, LM Studio, HF Inference API, Anthropic API, OpenAI-compatible HTTP). Register one capability instance per (backend, model, quant) tuple so the bus can see them as separate routable providers.
2. File layout
hearthnet/services/llm/
βββ __init__.py
βββ service.py # LlmService
βββ tokenizers.py # rough token counting per family
βββ backends/
βββ __init__.py
βββ base.py # LlmBackend Protocol
βββ llama_cpp.py # llama-cpp-python in-process
βββ ollama.py # Ollama HTTP at http://localhost:11434
βββ lmstudio.py # LM Studio HTTP (OpenAI-compatible)
βββ hf_api.py # HuggingFace Inference API
βββ anthropic_api.py # Anthropic Messages API
3. Public API
3.1 backends/base.py
# hearthnet/services/llm/backends/base.py
from dataclasses import dataclass
from typing import AsyncIterator, Protocol
@dataclass(frozen=True)
class Token:
text: str
logprob: float | None
stop: bool
@dataclass(frozen=True)
class ChatResult:
text: str
tokens_in: int
tokens_out: int
stop_reason: str # "end" | "max_tokens" | "stop_sequence" | "cancelled"
ms: int
@dataclass(frozen=True)
class BackendModel:
"""One model an LlmBackend can serve."""
name: str # "qwen2.5-7b-instruct"
quant: str # "q4_k_m", "q8_0", "fp16", "api"
ctx_max: int # 8192
modalities: list[str] # ["text"] or ["text", "vision"]
requires_internet: bool # API backends β True; local β False
class LlmBackend(Protocol):
"""Abstract backend. Implementations cover one provider."""
name: str # "llama_cpp" | "ollama" | ...
models: list[BackendModel]
async def warm(self, model: str) -> None: ...
async def close(self) -> None: ...
async def chat(
self,
*,
model: str,
messages: list[dict],
max_tokens: int = 1024,
temperature: float = 0.7,
top_p: float = 0.95,
stop: list[str] | None = None,
seed: int | None = None,
stream: bool = True,
) -> AsyncIterator[Token]:
"""Yields Tokens. The final Token has stop=True."""
async def complete(
self,
*,
model: str,
prompt: str,
max_tokens: int = 256,
temperature: float = 0.7,
top_p: float = 0.95,
stop: list[str] | None = None,
seed: int | None = None,
stream: bool = True,
) -> AsyncIterator[Token]: ...
def count_tokens(self, model: str, text: str) -> int:
"""Approximate token count; uses a per-model tokenizer if available."""
def max_concurrent(self, model: str) -> int:
"""Backend-specific concurrency limit. Used in capability descriptor."""
def health(self) -> dict: ...
3.2 Concrete backends
# hearthnet/services/llm/backends/llama_cpp.py
class LlamaCppBackend(LlmBackend):
"""In-process llama-cpp-python. Loads one model at a time per instance.
Multiple LlamaCppBackend instances may coexist if VRAM allows."""
def __init__(self, model_path: Path, model_meta: BackendModel, gpu_layers: int = -1):
...
# hearthnet/services/llm/backends/ollama.py
class OllamaBackend(LlmBackend):
"""HTTP-based Ollama at http://localhost:11434 (or remote)."""
def __init__(self, base_url: str = "http://localhost:11434", models: list[str] | None = None):
"""If models is None, discover via GET /api/tags."""
# hearthnet/services/llm/backends/lmstudio.py
class LmStudioBackend(LlmBackend):
"""OpenAI-compatible HTTP at http://host:1234.
Used in Christof's home setup at 192.168.188.25:1234."""
def __init__(self, base_url: str, default_model: str): ...
# hearthnet/services/llm/backends/hf_api.py
class HfApiBackend(LlmBackend):
"""HuggingFace Inference API. Requires HF_TOKEN env var (declared in config.llm.backends[].api_key_env)."""
def __init__(self, model: str, token_env: str = "HF_TOKEN"): ...
# hearthnet/services/llm/backends/anthropic_api.py
class AnthropicApiBackend(LlmBackend):
"""Anthropic Messages API. Phase 1.5; useful when internet up."""
def __init__(self, model: str = "claude-sonnet-4-6", token_env: str = "ANTHROPIC_API_KEY"): ...
3.3 tokenizers.py
# hearthnet/services/llm/tokenizers.py
def count_tokens_approx(model_family: str, text: str) -> int:
"""Fast heuristic: chars / 3.5 for Latin scripts, /2 for CJK.
Used when no real tokenizer is available."""
def model_family(model_name: str) -> str:
"""'qwen2.5-7b-instruct' β 'qwen', 'llama-3-8b' β 'llama', etc."""
3.4 service.py
# hearthnet/services/llm/service.py
class LlmService:
name = "llm"
version = "1.0"
def __init__(self, config: LlmConfig):
self._backends: list[LlmBackend] = self._build_backends(config)
def _build_backends(self, config: LlmConfig) -> list[LlmBackend]:
"""Instantiate each declared backend; skip backends that fail to initialise (with warning)."""
def capabilities(self) -> list[tuple[CapabilityDescriptor, Callable, ParamsPredicate]]:
"""Emits one (descriptor, handler, predicate) per (backend, model, capability_kind) combo:
- For each backend Γ each model: one llm.chat entry and one llm.complete entry.
- Each descriptor's params include model, quant, ctx, backend."""
async def start(self) -> None:
"""Warm one backend (the first listed) to avoid cold-start lag on first call."""
async def stop(self) -> None: ...
def health(self) -> dict: ...
# --- handlers ---
async def handle_chat(self, req: RouteRequest) -> AsyncIterator[dict]:
"""Streams SSE frames per CONTRACT Β§4.1.
Picks the backend from req.body['params']['model'] (matched at routing).
Maps backend Token β SSE 'token' frames; emits 'done' with meta."""
async def handle_complete(self, req: RouteRequest) -> AsyncIterator[dict]:
"""Same shape as chat but for CONTRACT Β§4.2."""
3.5 Capability descriptors
For each (backend, model) pair, the service registers:
# llm.chat instance
CapabilityDescriptor(
name="llm.chat",
version=(1, 0),
stability="stable",
request_schema={...}, # CONTRACT Β§4.1 schema
response_schema={...}, # for non-stream fallback
stream_schema={
"oneOf": [
{"type": "object", "required": ["text"], "properties": {"text": {"type": "string"}, "logprob": {"type": ["number", "null"]}}},
{"type": "object", "required": ["tokens_out", "stop_reason", "ms"]} # done frame
]
},
params={
"model": "<model.name>",
"quant": "<model.quant>",
"ctx": model.ctx_max,
"backend": "<backend.name>",
"modalities": model.modalities,
},
max_concurrent=backend.max_concurrent(model.name),
trust_required="member",
timeout_seconds=LLM_GENERATION_DEFAULT_TIMEOUT_SECONDS,
idempotent=False,
)
3.6 params_compatible predicate
def params_compatible(offered: dict, requested: dict) -> bool:
# Required match: model.
# Optional match: ctx (caller's must be <= offered).
if requested.get("model") != offered.get("model"):
return False
if "ctx" in requested and requested["ctx"] > offered["ctx"]:
return False
return True
4. Behaviour
4.1 Multi-backend selection
Multiple backends may serve the same model name (e.g. llama_cpp local + LM Studio remote both offer qwen2.5-7b-instruct). They register as separate capability entries. The bus router picks among them by latency/load β no service-internal preference logic.
4.2 Streaming and cancellation
handle_chatis an async generator- Each backend
Tokenbecomes one SSEtokenframe - On client disconnect, the generator is cancelled; the backend's
chat()async iterator receivesGeneratorExit, propagates cancellation to the underlying library (llama.cpp: set abort flag; HTTP backends: close connection) - Cleanup must complete within 200 ms
4.3 Internet-dependent backends
HfApiBackend and AnthropicApiBackend set requires_internet=True on their BackendModel. The service still registers them, but the M09 detector triggers deregistration from the local bus when offline. On restore, they are re-registered.
4.4 Tool calls (Phase 2)
The tool_call_delta stream frame in CONTRACT Β§4.1 is reserved. Backends that support tool calls (Anthropic, OpenAI, OpenAI-compatible) will emit these in a future version. MVP: ignored / empty.
4.5 Deterministic mode
If seed is present in request, backends that support seeded sampling apply it. llama_cpp does; HTTP APIs vary. When unsupported, backend still serves but does NOT promise determinism.
4.6 Token counting
Token counts in meta.tokens_in / meta.tokens_out:
llama_cpp: exact from the model- HTTP backends with usage in response: exact
- Others: approximate via
tokenizers.count_tokens_approx
5. Errors
| Condition | Wire code |
|---|---|
| Unknown model | not_found |
| Backend HTTP 5xx | internal_error |
| Backend HTTP rate limit | rate_limited (forwarded; retry_after_ms if available) |
| Empty messages array | bad_request |
| Context exceeded | bad_request (with message indicating size) |
| Generation timed out | timeout |
| Backend crashed mid-stream | emit error frame, then close |
6. Configuration
From X04 Β§3:
[[llm.backends]]
name = "lmstudio"
url = "http://192.168.188.25:1234"
model = "qwen2.5-7b-instruct"
[[llm.backends]]
name = "llama_cpp"
url = "" # local path; see backend
model = "qwen2.5-1.5b-instruct-q4_k_m.gguf"
[[llm.backends]]
name = "anthropic_api"
model = "claude-sonnet-4-6"
api_key_env = "ANTHROPIC_API_KEY"
Constant: LLM_GENERATION_DEFAULT_TIMEOUT_SECONDS = 120.
7. Tests
Unit
test_capabilities_one_entry_per_model_per_backendtest_handler_chat_emits_token_then_donetest_handler_chat_cancellation_within_200mstest_params_compatible_model_must_matchtest_params_compatible_ctx_upper_boundtest_internet_dependent_backend_deregistered_on_offline
Integration
test_lmstudio_backend_streams_real_tokens(requires LM Studio at the configured address; skip otherwise)test_three_node_llm_load_balancetest_remote_call_through_bus_returns_full_response
8. Cross-references
| What | Where |
|---|---|
llm.chat@1.0 wire |
CONTRACT Β§4.1 |
llm.complete@1.0 wire |
CONTRACT Β§4.2 |
| Service protocol | M03 Β§4 |
| Streaming format | CONTRACT Β§5.3, X01 Β§6 |
| Used by RAG | M05 Β§5 |
| Emergency mode deregistration | M09 Β§5 |
9. Open questions
- Vision models β Phase 2; reserved
modalities: ['text','vision']. Request schema gainsmessages[].content[].type='image_url'. - Tool calls β Phase 2; reserved frame
tool_call_delta. Will integrate Anthropic + OpenAI styles. - Local model autodiscovery β should
llama_cppbackend scan a models directory? Useful but easy to defer. - Per-model preset profiles β Phase 2: bind a
system_prompt_templateto a model. Not yet.