HearthNet-Nemotron / docs /M04-llm.md
Chris4K's picture
prd splitted + contracts
6f9a5fd
|
Raw
History Blame
12.2 kB

M04 β€” LLM Service

Spec version: v1.0 Depends on: M03 (bus), X04 (config), X03 (observability), backend libs (llama-cpp-python, ollama HTTP, httpx for HTTP backends) Depended on by: M05 (RAG uses llm.complete internally), M08 (UI passes user queries through llm.chat)


1. Responsibility

Provide llm.chat@1.0 and llm.complete@1.0. Wrap multiple inference backends (llama.cpp, Ollama, LM Studio, HF Inference API, Anthropic API, OpenAI-compatible HTTP). Register one capability instance per (backend, model, quant) tuple so the bus can see them as separate routable providers.


2. File layout

hearthnet/services/llm/
β”œβ”€β”€ __init__.py
β”œβ”€β”€ service.py                 # LlmService
β”œβ”€β”€ tokenizers.py              # rough token counting per family
└── backends/
    β”œβ”€β”€ __init__.py
    β”œβ”€β”€ base.py                # LlmBackend Protocol
    β”œβ”€β”€ llama_cpp.py           # llama-cpp-python in-process
    β”œβ”€β”€ ollama.py              # Ollama HTTP at http://localhost:11434
    β”œβ”€β”€ lmstudio.py            # LM Studio HTTP (OpenAI-compatible)
    β”œβ”€β”€ hf_api.py              # HuggingFace Inference API
    └── anthropic_api.py       # Anthropic Messages API

3. Public API

3.1 backends/base.py

# hearthnet/services/llm/backends/base.py
from dataclasses import dataclass
from typing import AsyncIterator, Protocol

@dataclass(frozen=True)
class Token:
    text:    str
    logprob: float | None
    stop:    bool

@dataclass(frozen=True)
class ChatResult:
    text:       str
    tokens_in:  int
    tokens_out: int
    stop_reason: str    # "end" | "max_tokens" | "stop_sequence" | "cancelled"
    ms:         int

@dataclass(frozen=True)
class BackendModel:
    """One model an LlmBackend can serve."""
    name:           str       # "qwen2.5-7b-instruct"
    quant:          str       # "q4_k_m", "q8_0", "fp16", "api"
    ctx_max:        int       # 8192
    modalities:     list[str] # ["text"] or ["text", "vision"]
    requires_internet: bool   # API backends β†’ True; local β†’ False

class LlmBackend(Protocol):
    """Abstract backend. Implementations cover one provider."""

    name:       str           # "llama_cpp" | "ollama" | ...
    models:     list[BackendModel]

    async def warm(self, model: str) -> None: ...
    async def close(self) -> None: ...

    async def chat(
        self,
        *,
        model: str,
        messages: list[dict],
        max_tokens: int = 1024,
        temperature: float = 0.7,
        top_p: float = 0.95,
        stop: list[str] | None = None,
        seed: int | None = None,
        stream: bool = True,
    ) -> AsyncIterator[Token]:
        """Yields Tokens. The final Token has stop=True."""

    async def complete(
        self,
        *,
        model: str,
        prompt: str,
        max_tokens: int = 256,
        temperature: float = 0.7,
        top_p: float = 0.95,
        stop: list[str] | None = None,
        seed: int | None = None,
        stream: bool = True,
    ) -> AsyncIterator[Token]: ...

    def count_tokens(self, model: str, text: str) -> int:
        """Approximate token count; uses a per-model tokenizer if available."""

    def max_concurrent(self, model: str) -> int:
        """Backend-specific concurrency limit. Used in capability descriptor."""

    def health(self) -> dict: ...

3.2 Concrete backends

# hearthnet/services/llm/backends/llama_cpp.py
class LlamaCppBackend(LlmBackend):
    """In-process llama-cpp-python. Loads one model at a time per instance.
       Multiple LlamaCppBackend instances may coexist if VRAM allows."""

    def __init__(self, model_path: Path, model_meta: BackendModel, gpu_layers: int = -1):
        ...

# hearthnet/services/llm/backends/ollama.py
class OllamaBackend(LlmBackend):
    """HTTP-based Ollama at http://localhost:11434 (or remote)."""

    def __init__(self, base_url: str = "http://localhost:11434", models: list[str] | None = None):
        """If models is None, discover via GET /api/tags."""

# hearthnet/services/llm/backends/lmstudio.py
class LmStudioBackend(LlmBackend):
    """OpenAI-compatible HTTP at http://host:1234.
       Used in Christof's home setup at 192.168.188.25:1234."""

    def __init__(self, base_url: str, default_model: str): ...

# hearthnet/services/llm/backends/hf_api.py
class HfApiBackend(LlmBackend):
    """HuggingFace Inference API. Requires HF_TOKEN env var (declared in config.llm.backends[].api_key_env)."""

    def __init__(self, model: str, token_env: str = "HF_TOKEN"): ...

# hearthnet/services/llm/backends/anthropic_api.py
class AnthropicApiBackend(LlmBackend):
    """Anthropic Messages API. Phase 1.5; useful when internet up."""

    def __init__(self, model: str = "claude-sonnet-4-6", token_env: str = "ANTHROPIC_API_KEY"): ...

3.3 tokenizers.py

# hearthnet/services/llm/tokenizers.py
def count_tokens_approx(model_family: str, text: str) -> int:
    """Fast heuristic: chars / 3.5 for Latin scripts, /2 for CJK.
       Used when no real tokenizer is available."""

def model_family(model_name: str) -> str:
    """'qwen2.5-7b-instruct' β†’ 'qwen', 'llama-3-8b' β†’ 'llama', etc."""

3.4 service.py

# hearthnet/services/llm/service.py
class LlmService:
    name    = "llm"
    version = "1.0"

    def __init__(self, config: LlmConfig):
        self._backends: list[LlmBackend] = self._build_backends(config)

    def _build_backends(self, config: LlmConfig) -> list[LlmBackend]:
        """Instantiate each declared backend; skip backends that fail to initialise (with warning)."""

    def capabilities(self) -> list[tuple[CapabilityDescriptor, Callable, ParamsPredicate]]:
        """Emits one (descriptor, handler, predicate) per (backend, model, capability_kind) combo:
        - For each backend Γ— each model: one llm.chat entry and one llm.complete entry.
        - Each descriptor's params include model, quant, ctx, backend."""

    async def start(self) -> None:
        """Warm one backend (the first listed) to avoid cold-start lag on first call."""

    async def stop(self) -> None: ...
    def health(self) -> dict: ...

    # --- handlers ---

    async def handle_chat(self, req: RouteRequest) -> AsyncIterator[dict]:
        """Streams SSE frames per CONTRACT Β§4.1.
        Picks the backend from req.body['params']['model'] (matched at routing).
        Maps backend Token β†’ SSE 'token' frames; emits 'done' with meta."""

    async def handle_complete(self, req: RouteRequest) -> AsyncIterator[dict]:
        """Same shape as chat but for CONTRACT Β§4.2."""

3.5 Capability descriptors

For each (backend, model) pair, the service registers:

# llm.chat instance
CapabilityDescriptor(
    name="llm.chat",
    version=(1, 0),
    stability="stable",
    request_schema={...},        # CONTRACT Β§4.1 schema
    response_schema={...},       # for non-stream fallback
    stream_schema={
        "oneOf": [
            {"type": "object", "required": ["text"], "properties": {"text": {"type": "string"}, "logprob": {"type": ["number", "null"]}}},
            {"type": "object", "required": ["tokens_out", "stop_reason", "ms"]}     # done frame
        ]
    },
    params={
        "model":   "<model.name>",
        "quant":   "<model.quant>",
        "ctx":     model.ctx_max,
        "backend": "<backend.name>",
        "modalities": model.modalities,
    },
    max_concurrent=backend.max_concurrent(model.name),
    trust_required="member",
    timeout_seconds=LLM_GENERATION_DEFAULT_TIMEOUT_SECONDS,
    idempotent=False,
)

3.6 params_compatible predicate

def params_compatible(offered: dict, requested: dict) -> bool:
    # Required match: model.
    # Optional match: ctx (caller's must be <= offered).
    if requested.get("model") != offered.get("model"):
        return False
    if "ctx" in requested and requested["ctx"] > offered["ctx"]:
        return False
    return True

4. Behaviour

4.1 Multi-backend selection

Multiple backends may serve the same model name (e.g. llama_cpp local + LM Studio remote both offer qwen2.5-7b-instruct). They register as separate capability entries. The bus router picks among them by latency/load β€” no service-internal preference logic.

4.2 Streaming and cancellation

  • handle_chat is an async generator
  • Each backend Token becomes one SSE token frame
  • On client disconnect, the generator is cancelled; the backend's chat() async iterator receives GeneratorExit, propagates cancellation to the underlying library (llama.cpp: set abort flag; HTTP backends: close connection)
  • Cleanup must complete within 200 ms

4.3 Internet-dependent backends

HfApiBackend and AnthropicApiBackend set requires_internet=True on their BackendModel. The service still registers them, but the M09 detector triggers deregistration from the local bus when offline. On restore, they are re-registered.

4.4 Tool calls (Phase 2)

The tool_call_delta stream frame in CONTRACT Β§4.1 is reserved. Backends that support tool calls (Anthropic, OpenAI, OpenAI-compatible) will emit these in a future version. MVP: ignored / empty.

4.5 Deterministic mode

If seed is present in request, backends that support seeded sampling apply it. llama_cpp does; HTTP APIs vary. When unsupported, backend still serves but does NOT promise determinism.

4.6 Token counting

Token counts in meta.tokens_in / meta.tokens_out:

  • llama_cpp: exact from the model
  • HTTP backends with usage in response: exact
  • Others: approximate via tokenizers.count_tokens_approx

5. Errors

Condition Wire code
Unknown model not_found
Backend HTTP 5xx internal_error
Backend HTTP rate limit rate_limited (forwarded; retry_after_ms if available)
Empty messages array bad_request
Context exceeded bad_request (with message indicating size)
Generation timed out timeout
Backend crashed mid-stream emit error frame, then close

6. Configuration

From X04 Β§3:

[[llm.backends]]
name  = "lmstudio"
url   = "http://192.168.188.25:1234"
model = "qwen2.5-7b-instruct"

[[llm.backends]]
name  = "llama_cpp"
url   = ""                # local path; see backend
model = "qwen2.5-1.5b-instruct-q4_k_m.gguf"

[[llm.backends]]
name        = "anthropic_api"
model       = "claude-sonnet-4-6"
api_key_env = "ANTHROPIC_API_KEY"

Constant: LLM_GENERATION_DEFAULT_TIMEOUT_SECONDS = 120.


7. Tests

Unit

  • test_capabilities_one_entry_per_model_per_backend
  • test_handler_chat_emits_token_then_done
  • test_handler_chat_cancellation_within_200ms
  • test_params_compatible_model_must_match
  • test_params_compatible_ctx_upper_bound
  • test_internet_dependent_backend_deregistered_on_offline

Integration

  • test_lmstudio_backend_streams_real_tokens (requires LM Studio at the configured address; skip otherwise)
  • test_three_node_llm_load_balance
  • test_remote_call_through_bus_returns_full_response

8. Cross-references

What Where
llm.chat@1.0 wire CONTRACT Β§4.1
llm.complete@1.0 wire CONTRACT Β§4.2
Service protocol M03 Β§4
Streaming format CONTRACT Β§5.3, X01 Β§6
Used by RAG M05 Β§5
Emergency mode deregistration M09 Β§5

9. Open questions

  1. Vision models β€” Phase 2; reserved modalities: ['text','vision']. Request schema gains messages[].content[].type='image_url'.
  2. Tool calls β€” Phase 2; reserved frame tool_call_delta. Will integrate Anthropic + OpenAI styles.
  3. Local model autodiscovery β€” should llama_cpp backend scan a models directory? Useful but easy to defer.
  4. Per-model preset profiles β€” Phase 2: bind a system_prompt_template to a model. Not yet.