HearthNet-Nemotron

Running on Zero

App Files Files Community

HearthNet-Nemotron / docs /M04-llm.md

Chris4K

prd splitted + contracts

6f9a5fd 19 days ago

preview code

Raw

History Blame

12.2 kB

	# M04 — LLM Service

	Spec version: v1.0
	Depends on: M03 (bus), X04 (config), X03 (observability), backend libs (llama-cpp-python, ollama HTTP, httpx for HTTP backends)
	Depended on by: M05 (RAG uses llm.complete internally), M08 (UI passes user queries through llm.chat)

	---

	## 1. Responsibility

	Provide `llm.chat@1.0` and `llm.complete@1.0`. Wrap multiple inference backends (llama.cpp, Ollama, LM Studio, HF Inference API, Anthropic API, OpenAI-compatible HTTP). Register one capability instance per (backend, model, quant) tuple so the bus can see them as separate routable providers.

	---

	## 2. File layout

	```
	hearthnet/services/llm/
	├── __init__.py
	├── service.py # LlmService
	├── tokenizers.py # rough token counting per family
	└── backends/
	├── __init__.py
	├── base.py # LlmBackend Protocol
	├── llama_cpp.py # llama-cpp-python in-process
	├── ollama.py # Ollama HTTP at http://localhost:11434
	├── lmstudio.py # LM Studio HTTP (OpenAI-compatible)
	├── hf_api.py # HuggingFace Inference API
	└── anthropic_api.py # Anthropic Messages API
	```

	---

	## 3. Public API

	### 3.1 `backends/base.py`

	```python
	# hearthnet/services/llm/backends/base.py
	from dataclasses import dataclass
	from typing import AsyncIterator, Protocol

	@dataclass(frozen=True)
	class Token:
	text: str
	logprob: float \| None
	stop: bool

	@dataclass(frozen=True)
	class ChatResult:
	text: str
	tokens_in: int
	tokens_out: int
	stop_reason: str # "end" \| "max_tokens" \| "stop_sequence" \| "cancelled"
	ms: int

	@dataclass(frozen=True)
	class BackendModel:
	"""One model an LlmBackend can serve."""
	name: str # "qwen2.5-7b-instruct"
	quant: str # "q4_k_m", "q8_0", "fp16", "api"
	ctx_max: int # 8192
	modalities: list[str] # ["text"] or ["text", "vision"]
	requires_internet: bool # API backends → True; local → False

	class LlmBackend(Protocol):
	"""Abstract backend. Implementations cover one provider."""

	name: str # "llama_cpp" \| "ollama" \| ...
	models: list[BackendModel]

	async def warm(self, model: str) -> None: ...
	async def close(self) -> None: ...

	async def chat(
	self,
	*,
	model: str,
	messages: list[dict],
	max_tokens: int = 1024,
	temperature: float = 0.7,
	top_p: float = 0.95,
	stop: list[str] \| None = None,
	seed: int \| None = None,
	stream: bool = True,
	) -> AsyncIterator[Token]:
	"""Yields Tokens. The final Token has stop=True."""

	async def complete(
	self,
	*,
	model: str,
	prompt: str,
	max_tokens: int = 256,
	temperature: float = 0.7,
	top_p: float = 0.95,
	stop: list[str] \| None = None,
	seed: int \| None = None,
	stream: bool = True,
	) -> AsyncIterator[Token]: ...

	def count_tokens(self, model: str, text: str) -> int:
	"""Approximate token count; uses a per-model tokenizer if available."""

	def max_concurrent(self, model: str) -> int:
	"""Backend-specific concurrency limit. Used in capability descriptor."""

	def health(self) -> dict: ...
	```

	### 3.2 Concrete backends

	```python
	# hearthnet/services/llm/backends/llama_cpp.py
	class LlamaCppBackend(LlmBackend):
	"""In-process llama-cpp-python. Loads one model at a time per instance.
	Multiple LlamaCppBackend instances may coexist if VRAM allows."""

	def __init__(self, model_path: Path, model_meta: BackendModel, gpu_layers: int = -1):
	...

	# hearthnet/services/llm/backends/ollama.py
	class OllamaBackend(LlmBackend):
	"""HTTP-based Ollama at http://localhost:11434 (or remote)."""

	def __init__(self, base_url: str = "http://localhost:11434", models: list[str] \| None = None):
	"""If models is None, discover via GET /api/tags."""

	# hearthnet/services/llm/backends/lmstudio.py
	class LmStudioBackend(LlmBackend):
	"""OpenAI-compatible HTTP at http://host:1234.
	Used in Christof's home setup at 192.168.188.25:1234."""

	def __init__(self, base_url: str, default_model: str): ...

	# hearthnet/services/llm/backends/hf_api.py
	class HfApiBackend(LlmBackend):
	"""HuggingFace Inference API. Requires HF_TOKEN env var (declared in config.llm.backends[].api_key_env)."""

	def __init__(self, model: str, token_env: str = "HF_TOKEN"): ...

	# hearthnet/services/llm/backends/anthropic_api.py
	class AnthropicApiBackend(LlmBackend):
	"""Anthropic Messages API. Phase 1.5; useful when internet up."""

	def __init__(self, model: str = "claude-sonnet-4-6", token_env: str = "ANTHROPIC_API_KEY"): ...
	```

	### 3.3 `tokenizers.py`

	```python
	# hearthnet/services/llm/tokenizers.py
	def count_tokens_approx(model_family: str, text: str) -> int:
	"""Fast heuristic: chars / 3.5 for Latin scripts, /2 for CJK.
	Used when no real tokenizer is available."""

	def model_family(model_name: str) -> str:
	"""'qwen2.5-7b-instruct' → 'qwen', 'llama-3-8b' → 'llama', etc."""
	```

	### 3.4 `service.py`

	```python
	# hearthnet/services/llm/service.py
	class LlmService:
	name = "llm"
	version = "1.0"

	def __init__(self, config: LlmConfig):
	self._backends: list[LlmBackend] = self._build_backends(config)

	def _build_backends(self, config: LlmConfig) -> list[LlmBackend]:
	"""Instantiate each declared backend; skip backends that fail to initialise (with warning)."""

	def capabilities(self) -> list[tuple[CapabilityDescriptor, Callable, ParamsPredicate]]:
	"""Emits one (descriptor, handler, predicate) per (backend, model, capability_kind) combo:
	- For each backend × each model: one llm.chat entry and one llm.complete entry.
	- Each descriptor's params include model, quant, ctx, backend."""

	async def start(self) -> None:
	"""Warm one backend (the first listed) to avoid cold-start lag on first call."""

	async def stop(self) -> None: ...
	def health(self) -> dict: ...

	# --- handlers ---

	async def handle_chat(self, req: RouteRequest) -> AsyncIterator[dict]:
	"""Streams SSE frames per CONTRACT §4.1.
	Picks the backend from req.body['params']['model'] (matched at routing).
	Maps backend Token → SSE 'token' frames; emits 'done' with meta."""

	async def handle_complete(self, req: RouteRequest) -> AsyncIterator[dict]:
	"""Same shape as chat but for CONTRACT §4.2."""
	```

	### 3.5 Capability descriptors

	For each `(backend, model)` pair, the service registers:

	```python
	# llm.chat instance
	CapabilityDescriptor(
	name="llm.chat",
	version=(1, 0),
	stability="stable",
	request_schema={...}, # CONTRACT §4.1 schema
	response_schema={...}, # for non-stream fallback
	stream_schema={
	"oneOf": [
	{"type": "object", "required": ["text"], "properties": {"text": {"type": "string"}, "logprob": {"type": ["number", "null"]}}},
	{"type": "object", "required": ["tokens_out", "stop_reason", "ms"]} # done frame
	]
	},
	params={
	"model": "<model.name>",
	"quant": "<model.quant>",
	"ctx": model.ctx_max,
	"backend": "<backend.name>",
	"modalities": model.modalities,
	},
	max_concurrent=backend.max_concurrent(model.name),
	trust_required="member",
	timeout_seconds=LLM_GENERATION_DEFAULT_TIMEOUT_SECONDS,
	idempotent=False,
	)
	```

	### 3.6 `params_compatible` predicate

	```python
	def params_compatible(offered: dict, requested: dict) -> bool:
	# Required match: model.
	# Optional match: ctx (caller's must be <= offered).
	if requested.get("model") != offered.get("model"):
	return False
	if "ctx" in requested and requested["ctx"] > offered["ctx"]:
	return False
	return True
	```

	---

	## 4. Behaviour

	### 4.1 Multi-backend selection

	Multiple backends may serve the same model name (e.g. llama_cpp local + LM Studio remote both offer `qwen2.5-7b-instruct`). They register as separate capability entries. The bus router picks among them by latency/load — no service-internal preference logic.

	### 4.2 Streaming and cancellation

	- `handle_chat` is an async generator
	- Each backend `Token` becomes one SSE `token` frame
	- On client disconnect, the generator is cancelled; the backend's `chat()` async iterator receives `GeneratorExit`, propagates cancellation to the underlying library (llama.cpp: set abort flag; HTTP backends: close connection)
	- Cleanup must complete within 200 ms

	### 4.3 Internet-dependent backends

	`HfApiBackend` and `AnthropicApiBackend` set `requires_internet=True` on their `BackendModel`. The service still registers them, but the [M09](M09-emergency.md) detector triggers deregistration from the local bus when offline. On restore, they are re-registered.

	### 4.4 Tool calls (Phase 2)

	The `tool_call_delta` stream frame in [CONTRACT §4.1](../CAPABILITY_CONTRACT.md) is reserved. Backends that support tool calls (Anthropic, OpenAI, OpenAI-compatible) will emit these in a future version. MVP: ignored / empty.

	### 4.5 Deterministic mode

	If `seed` is present in request, backends that support seeded sampling apply it. `llama_cpp` does; HTTP APIs vary. When unsupported, backend still serves but does NOT promise determinism.

	### 4.6 Token counting

	Token counts in `meta.tokens_in` / `meta.tokens_out`:
	- `llama_cpp`: exact from the model
	- HTTP backends with usage in response: exact
	- Others: approximate via `tokenizers.count_tokens_approx`

	---

	## 5. Errors

	\| Condition \| Wire code \|
	\|-----------\|-----------\|
	\| Unknown model \| `not_found` \|
	\| Backend HTTP 5xx \| `internal_error` \|
	\| Backend HTTP rate limit \| `rate_limited` (forwarded; `retry_after_ms` if available) \|
	\| Empty messages array \| `bad_request` \|
	\| Context exceeded \| `bad_request` (with message indicating size) \|
	\| Generation timed out \| `timeout` \|
	\| Backend crashed mid-stream \| emit `error` frame, then close \|

	---

	## 6. Configuration

	From [X04 §3](../cross-cutting/X04-config.md):

	```toml
	[[llm.backends]]
	name = "lmstudio"
	url = "http://192.168.188.25:1234"
	model = "qwen2.5-7b-instruct"

	[[llm.backends]]
	name = "llama_cpp"
	url = "" # local path; see backend
	model = "qwen2.5-1.5b-instruct-q4_k_m.gguf"

	[[llm.backends]]
	name = "anthropic_api"
	model = "claude-sonnet-4-6"
	api_key_env = "ANTHROPIC_API_KEY"
	```

	Constant: `LLM_GENERATION_DEFAULT_TIMEOUT_SECONDS = 120`.

	---

	## 7. Tests

	### Unit
	- `test_capabilities_one_entry_per_model_per_backend`
	- `test_handler_chat_emits_token_then_done`
	- `test_handler_chat_cancellation_within_200ms`
	- `test_params_compatible_model_must_match`
	- `test_params_compatible_ctx_upper_bound`
	- `test_internet_dependent_backend_deregistered_on_offline`

	### Integration
	- `test_lmstudio_backend_streams_real_tokens` (requires LM Studio at the configured address; skip otherwise)
	- `test_three_node_llm_load_balance`
	- `test_remote_call_through_bus_returns_full_response`

	---

	## 8. Cross-references

	\| What \| Where \|
	\|------\|-------\|
	\| `llm.chat@1.0` wire \| [CONTRACT §4.1](../CAPABILITY_CONTRACT.md) \|
	\| `llm.complete@1.0` wire \| [CONTRACT §4.2](../CAPABILITY_CONTRACT.md) \|
	\| Service protocol \| [M03 §4](M03-bus.md) \|
	\| Streaming format \| [CONTRACT §5.3](../CAPABILITY_CONTRACT.md), [X01 §6](../cross-cutting/X01-transport.md) \|
	\| Used by RAG \| [M05 §5](M05-rag.md) \|
	\| Emergency mode deregistration \| [M09 §5](M09-emergency.md) \|

	---

	## 9. Open questions

	1. Vision models — Phase 2; reserved `modalities: ['text','vision']`. Request schema gains `messages[].content[].type='image_url'`.
	2. Tool calls — Phase 2; reserved frame `tool_call_delta`. Will integrate Anthropic + OpenAI styles.
	3. Local model autodiscovery — should `llama_cpp` backend scan a models directory? Useful but easy to defer.
	4. Per-model preset profiles — Phase 2: bind a `system_prompt_template` to a model. Not yet.