HearthNet-Nemotron

Running on Zero

File size: 12,035 Bytes

70650b7

# M20 — Vision Services

**Spec version:** v1.0 (Phase 2)
**Depends on:** M03 (bus), M07 (blobs), M04 (LLM, extended), X04 (config), X03 (observability)
**Depended on by:** M08 UI (image describe in ask tab; generate in tools), M21 tools (vision used by tool-augmented LLM), M17 OCR (Florence-2 OCR mode shared)

---

## 1. Responsibility

Two capability families:

- `img.describe@1.0` — given an image CID, produce a caption, tags, object list, or OCR
- `img.generate@1.0` — given a prompt, generate an image

Plus: extend the LLM `llm.chat@2.0` request schema with **multimodal content** (text + image_cid in messages). The multimodal path goes through M04 backends that declare `modalities: ["text", "vision"]`. M20 is responsible for providing the vision backends those LLMs depend on.

Christof's existing pipelines: Florence-2 (describe), FLUX.1-dev with LoRAs (generate), MiniCPM-V (multimodal). All wired in.

---

## 2. File layout

```
hearthnet/services/image/
├── __init__.py
├── describe_service.py
├── generate_service.py
└── backends/
    ├── __init__.py
    ├── base.py             # ImageDescribeBackend, ImageGenerateBackend
    ├── florence2.py        # Microsoft Florence-2 large
    ├── minicpm_v.py        # OpenBMB MiniCPM-V (also usable for chat-with-vision)
    ├── flux.py             # black-forest-labs FLUX.1-dev with LoRA support
    └── stable_diffusion.py # Optional SD-XL fallback
```

---

## 3. Public API — describe

### 3.1 `backends/base.py`

```python
@dataclass(frozen=True)
class ImageDescription:
    caption:    str
    detailed_caption: str | None
    tags:       list[str]
    objects:    list[dict]                  # [{label, bbox, confidence}]
    ocr_text:   str | None
    language:   str
    ms:         int

class ImageDescribeBackend(Protocol):
    name:               str
    tasks_supported:    list[str]           # subset of {"caption","detailed_caption","ocr","objects","tags"}
    languages:          list[str]
    max_pixels:         int

    async def warm(self) -> None: ...
    async def close(self) -> None: ...

    async def describe(
        self,
        image_bytes: bytes,
        *,
        task: str,
        language: str = "en",
    ) -> ImageDescription: ...

    def health(self) -> dict: ...
```

### 3.2 `describe_service.py`

```python
class ImageDescribeService:
    name    = "image.describe"
    version = "1.0"

    def __init__(self, config: VisionConfig, blob_store: BlobStore):
        ...

    def capabilities(self) -> list[tuple[CapabilityDescriptor, Callable, ParamsPredicate]]:
        """One img.describe per backend. Params include backend name and tasks_supported."""

    async def start(self) -> None: ...
    async def stop(self) -> None: ...
    def health(self) -> dict: ...

    async def handle_describe(self, req: RouteRequest) -> dict:
        """CAP2 §4.13."""
```

### 3.3 Concrete describe backends

```python
class Florence2Backend(ImageDescribeBackend):
    def __init__(self, model: str = "microsoft/Florence-2-large", device: str = "auto"):
        ...

class MinicpmVBackend(ImageDescribeBackend):
    """Used both standalone (img.describe) and as an LLM vision backend (M04 extension)."""
    def __init__(self, model: str = "openbmb/MiniCPM-V-2_6", device: str = "auto"):
        ...
```

---

## 4. Public API — generate

### 4.1 `backends/base.py`

```python
@dataclass(frozen=True)
class GenerationResult:
    image_bytes:    bytes
    width:          int
    height:         int
    format:         str             # "png" | "webp" | "jpg"
    seed:           int
    ms:             int

class ImageGenerateBackend(Protocol):
    name:               str
    models:             list[str]
    loras_available:    list[str]
    max_resolution:     tuple[int, int]
    min_resolution:     tuple[int, int]
    supports_negative_prompt: bool

    async def warm(self, model: str) -> None: ...
    async def close(self) -> None: ...

    async def generate(
        self,
        prompt: str,
        *,
        model: str,
        lora: str | None,
        negative_prompt: str | None,
        width: int,
        height: int,
        steps: int,
        seed: int | None,
        progress_cb: Callable[[int, int], None] | None = None,
    ) -> GenerationResult: ...

    def health(self) -> dict: ...
```

### 4.2 `generate_service.py`

```python
class ImageGenerateService:
    name    = "image.generate"
    version = "1.0"

    def __init__(self, config: VisionConfig, blob_store: BlobStore):
        ...

    def capabilities(self) -> list[tuple[CapabilityDescriptor, Callable, ParamsPredicate]]:
        """One img.generate per (backend, model) combo. params declare loras_available."""

    async def start(self) -> None: ...
    async def stop(self) -> None: ...
    def health(self) -> dict: ...

    async def handle_generate(self, req: RouteRequest) -> AsyncIterator[dict]:
        """CAP2 §4.14.
        1. Generate (streaming progress frames)
        2. Store resulting image as blob
        3. Emit done with image_cid"""
```

### 4.3 Concrete generate backends

```python
class FluxBackend(ImageGenerateBackend):
    """FLUX.1-dev with LoRA support; Christof's existing pipeline."""
    def __init__(
        self,
        model: str = "black-forest-labs/FLUX.1-dev",
        device: str = "auto",
        loras_dir: Path = Path("~/.hearthnet/loras"),
    ):
        ...

class StableDiffusionBackend(ImageGenerateBackend):
    """SD-XL fallback for nodes with smaller GPUs."""
    def __init__(self, model: str = "stabilityai/stable-diffusion-xl-base-1.0", device: str = "auto"):
        ...
```

---

## 5. Multimodal LLM extension (M04 hook)

### 5.1 Message content array

In `llm.chat@2.0` (CAP2 §4.23), each `messages[].content` may be a list:

```json
[
  {"type": "text",  "text": "Was ist auf diesem Bild?"},
  {"type": "image", "image_cid": "blake3:..."}
]
```

Backends that declare `modalities: ["text", "vision"]` in their descriptor must handle the array form. Backends that don't either:
- Are skipped by the router (params_compatible returns False when message contains image and `modalities ⊉ {"vision"}`)
- Or fall back: extract text content only, ignore images (worse UX; not recommended)

### 5.2 Vision-capable backends in M04

These M04 backends gain a `modalities: ["text","vision"]` declaration in Phase 2:

| Backend | Vision support |
|---------|----------------|
| `MinicpmVBackend` (M04 entry — same model as M20's describe) | Yes; native multimodal |
| `AnthropicApiBackend` | Yes; Claude vision via API |
| `OpenAiApiBackend` | Yes; GPT-4V |
| `Llava` (new, optional) | Yes; LLaVA via llama.cpp |
| `LlamaCppBackend` | Yes if model is multimodal (LLaVA-format) |
| `OllamaBackend` | Yes for vision models |
| Others | No |

The `M04.LlmService._build_backends` constructs these with their vision flag.

### 5.3 Image preprocessing

For LLM context, images are:
- Loaded from blob store via CID
- Resized to backend's preferred resolution (e.g. 1024×1024 for MiniCPM-V)
- Encoded base64 or sent as bytes per backend's protocol

This is opaque to the caller — the multimodal `messages` array is the contract.

---

## 6. Behaviour

### 6.1 Image describe lifecycle

```
caller → bus.call("img.describe", (1,0), {input:{image_cid:..., task:"detailed_caption"}})
  → ImageDescribeService.handle_describe
       → blob_store.read_blob_bytes(image_cid)
       → backend.describe(bytes, task=...)
       → return ImageDescription serialised
```

### 6.2 Image generate lifecycle

```
caller → bus.stream("img.generate", (1,0), {input:{prompt:"...", steps:20}})
  → ImageGenerateService.handle_generate
       → backend.generate(...) with progress_cb
       → for each step: emit 'progress' frame
       → on completion: blob_store.write_blob(image)
       → emit 'done' frame with image_cid
```

### 6.3 Safety filters

- `img.generate` prompts pass through a configurable safety filter list (regex blocklist + optional LLM-based classifier)
- Generation of identifiable persons is blocked by default (configurable: `config.vision.allow_identifiable_persons`)
- NSFW filter on output (Stable Diffusion has built-in; FLUX needs separate model)
- Failed safety → `bad_request` with `reason: "safety_filter"`

### 6.4 LoRA management

`FluxBackend.loras_available` lists LoRAs found in `loras_dir`. Caller can request a specific LoRA in `params.lora`. Loading a LoRA takes a few seconds on first use; cached thereafter.

Christof's existing LoRAs (local-style, sketches, etc.) drop into the `loras_dir`. The backend auto-discovers them.

### 6.5 GPU pressure

Vision models are heavy. Recommended:

- One Florence-2 instance per node (always-loaded)
- FLUX/SD only loaded on-demand (warm on first request, kept hot for 5 minutes)
- `max_concurrent = 1` for FLUX; `2` for describe backends

These limits are declared in the capability descriptor so the bus throttles correctly.

### 6.6 Multimodal LLM call routing

When a user sends a multimodal message:

```
UI → bus.stream("llm.chat", (2,0), {input:{messages:[{role:"user", content:[{type:"text",text:"..."},{type:"image",image_cid:"..."}]}]}})
  → Router.route filters candidates to those with modalities ⊇ {"vision"}
  → picks best (e.g. MinicpmV local, fall back to Anthropic API)
  → backend handles base64 / image-token-injection internally
```

If no vision-capable backend is online, the call returns `not_found` with a helpful `alt_capabilities` hint pointing to describe-then-text-only fallback (UI can offer this).

---

## 7. Errors

| Condition | Wire code |
|-----------|-----------|
| Unknown task | `bad_request` |
| Image too large | `bad_request` |
| Prompt safety violation | `bad_request` (reason=safety_filter) |
| LoRA not found | `not_found` |
| GPU OOM | `capacity_exceeded` |
| Backend missing for requested task | `not_implemented` |

---

## 8. Configuration

```python
config.vision.enabled                 = True
config.vision.describe_backends = [
    DescribeBackendConfig(name="florence2", model="microsoft/Florence-2-large", device="auto"),
    DescribeBackendConfig(name="minicpm_v", model="openbmb/MiniCPM-V-2_6", device="auto"),
]
config.vision.generate_backends = [
    GenerateBackendConfig(name="flux", model="black-forest-labs/FLUX.1-dev",
                          loras_dir=Path("~/.hearthnet/loras"), device="auto"),
]
config.vision.allow_identifiable_persons = False
config.vision.safety_blocklist_file       = None   # optional regex file
```

---

## 9. Tests

### Unit
- `test_describe_descriptor_per_backend`
- `test_safety_filter_blocks_known_pattern`
- `test_lora_discovery`
- `test_oom_returns_capacity_exceeded`

### Integration
- `test_florence2_caption_sample` (test image)
- `test_flux_generate_with_lora_progress_frames`
- `test_multimodal_llm_routes_to_vision_backend`
- `test_describe_then_text_fallback_when_no_vision_llm`

---

## 10. Cross-references

| What | Where |
|------|-------|
| `img.*` wire | [CAP2 §4.13–4.14](../CAPABILITY_CONTRACT_v2.md) |
| Multimodal `llm.chat@2.0` | [CAP2 §4.23](../CAPABILITY_CONTRACT_v2.md) |
| LLM service extension | M04 (extended in Phase 2 — see [00-OVERVIEW §1](../00-OVERVIEW.md)) |
| OCR overlap | [M17 §3.2 florence_ocr](M17-ocr.md) |
| Christof's pipelines | external, this is the integration |

---

## 11. Open questions

1. **Video** — Phase 3 considers `video.describe` and `video.generate` (LTX-Video). Not in Phase 2.
2. **Image editing (inpainting)** — Phase 2.5: `img.edit@1.0` capability. Reserved.
3. **Control nets (depth, edge, pose)** — Phase 2.5.
4. **3D generation** — Phase 3 with TripoSR or similar.
5. **Safety filter quality** — regex blocklist is weak. An LLM-as-judge classifier is better but adds latency. Configurable; default off.
6. **LoRA stacking** — caller specifies multiple `loras: ["...","..."]`. Implementable but adds attack surface (prompt-LoRA combos). Defer.