File size: 12,035 Bytes
70650b7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
# M20 β€” Vision Services

**Spec version:** v1.0 (Phase 2)
**Depends on:** M03 (bus), M07 (blobs), M04 (LLM, extended), X04 (config), X03 (observability)
**Depended on by:** M08 UI (image describe in ask tab; generate in tools), M21 tools (vision used by tool-augmented LLM), M17 OCR (Florence-2 OCR mode shared)

---

## 1. Responsibility

Two capability families:

- `img.describe@1.0` β€” given an image CID, produce a caption, tags, object list, or OCR
- `img.generate@1.0` β€” given a prompt, generate an image

Plus: extend the LLM `llm.chat@2.0` request schema with **multimodal content** (text + image_cid in messages). The multimodal path goes through M04 backends that declare `modalities: ["text", "vision"]`. M20 is responsible for providing the vision backends those LLMs depend on.

Christof's existing pipelines: Florence-2 (describe), FLUX.1-dev with LoRAs (generate), MiniCPM-V (multimodal). All wired in.

---

## 2. File layout

```
hearthnet/services/image/
β”œβ”€β”€ __init__.py
β”œβ”€β”€ describe_service.py
β”œβ”€β”€ generate_service.py
└── backends/
    β”œβ”€β”€ __init__.py
    β”œβ”€β”€ base.py             # ImageDescribeBackend, ImageGenerateBackend
    β”œβ”€β”€ florence2.py        # Microsoft Florence-2 large
    β”œβ”€β”€ minicpm_v.py        # OpenBMB MiniCPM-V (also usable for chat-with-vision)
    β”œβ”€β”€ flux.py             # black-forest-labs FLUX.1-dev with LoRA support
    └── stable_diffusion.py # Optional SD-XL fallback
```

---

## 3. Public API β€” describe

### 3.1 `backends/base.py`

```python
@dataclass(frozen=True)
class ImageDescription:
    caption:    str
    detailed_caption: str | None
    tags:       list[str]
    objects:    list[dict]                  # [{label, bbox, confidence}]
    ocr_text:   str | None
    language:   str
    ms:         int

class ImageDescribeBackend(Protocol):
    name:               str
    tasks_supported:    list[str]           # subset of {"caption","detailed_caption","ocr","objects","tags"}
    languages:          list[str]
    max_pixels:         int

    async def warm(self) -> None: ...
    async def close(self) -> None: ...

    async def describe(
        self,
        image_bytes: bytes,
        *,
        task: str,
        language: str = "en",
    ) -> ImageDescription: ...

    def health(self) -> dict: ...
```

### 3.2 `describe_service.py`

```python
class ImageDescribeService:
    name    = "image.describe"
    version = "1.0"

    def __init__(self, config: VisionConfig, blob_store: BlobStore):
        ...

    def capabilities(self) -> list[tuple[CapabilityDescriptor, Callable, ParamsPredicate]]:
        """One img.describe per backend. Params include backend name and tasks_supported."""

    async def start(self) -> None: ...
    async def stop(self) -> None: ...
    def health(self) -> dict: ...

    async def handle_describe(self, req: RouteRequest) -> dict:
        """CAP2 Β§4.13."""
```

### 3.3 Concrete describe backends

```python
class Florence2Backend(ImageDescribeBackend):
    def __init__(self, model: str = "microsoft/Florence-2-large", device: str = "auto"):
        ...

class MinicpmVBackend(ImageDescribeBackend):
    """Used both standalone (img.describe) and as an LLM vision backend (M04 extension)."""
    def __init__(self, model: str = "openbmb/MiniCPM-V-2_6", device: str = "auto"):
        ...
```

---

## 4. Public API β€” generate

### 4.1 `backends/base.py`

```python
@dataclass(frozen=True)
class GenerationResult:
    image_bytes:    bytes
    width:          int
    height:         int
    format:         str             # "png" | "webp" | "jpg"
    seed:           int
    ms:             int

class ImageGenerateBackend(Protocol):
    name:               str
    models:             list[str]
    loras_available:    list[str]
    max_resolution:     tuple[int, int]
    min_resolution:     tuple[int, int]
    supports_negative_prompt: bool

    async def warm(self, model: str) -> None: ...
    async def close(self) -> None: ...

    async def generate(
        self,
        prompt: str,
        *,
        model: str,
        lora: str | None,
        negative_prompt: str | None,
        width: int,
        height: int,
        steps: int,
        seed: int | None,
        progress_cb: Callable[[int, int], None] | None = None,
    ) -> GenerationResult: ...

    def health(self) -> dict: ...
```

### 4.2 `generate_service.py`

```python
class ImageGenerateService:
    name    = "image.generate"
    version = "1.0"

    def __init__(self, config: VisionConfig, blob_store: BlobStore):
        ...

    def capabilities(self) -> list[tuple[CapabilityDescriptor, Callable, ParamsPredicate]]:
        """One img.generate per (backend, model) combo. params declare loras_available."""

    async def start(self) -> None: ...
    async def stop(self) -> None: ...
    def health(self) -> dict: ...

    async def handle_generate(self, req: RouteRequest) -> AsyncIterator[dict]:
        """CAP2 Β§4.14.
        1. Generate (streaming progress frames)
        2. Store resulting image as blob
        3. Emit done with image_cid"""
```

### 4.3 Concrete generate backends

```python
class FluxBackend(ImageGenerateBackend):
    """FLUX.1-dev with LoRA support; Christof's existing pipeline."""
    def __init__(
        self,
        model: str = "black-forest-labs/FLUX.1-dev",
        device: str = "auto",
        loras_dir: Path = Path("~/.hearthnet/loras"),
    ):
        ...

class StableDiffusionBackend(ImageGenerateBackend):
    """SD-XL fallback for nodes with smaller GPUs."""
    def __init__(self, model: str = "stabilityai/stable-diffusion-xl-base-1.0", device: str = "auto"):
        ...
```

---

## 5. Multimodal LLM extension (M04 hook)

### 5.1 Message content array

In `llm.chat@2.0` (CAP2 Β§4.23), each `messages[].content` may be a list:

```json
[
  {"type": "text",  "text": "Was ist auf diesem Bild?"},
  {"type": "image", "image_cid": "blake3:..."}
]
```

Backends that declare `modalities: ["text", "vision"]` in their descriptor must handle the array form. Backends that don't either:
- Are skipped by the router (params_compatible returns False when message contains image and `modalities βŠ‰ {"vision"}`)
- Or fall back: extract text content only, ignore images (worse UX; not recommended)

### 5.2 Vision-capable backends in M04

These M04 backends gain a `modalities: ["text","vision"]` declaration in Phase 2:

| Backend | Vision support |
|---------|----------------|
| `MinicpmVBackend` (M04 entry β€” same model as M20's describe) | Yes; native multimodal |
| `AnthropicApiBackend` | Yes; Claude vision via API |
| `OpenAiApiBackend` | Yes; GPT-4V |
| `Llava` (new, optional) | Yes; LLaVA via llama.cpp |
| `LlamaCppBackend` | Yes if model is multimodal (LLaVA-format) |
| `OllamaBackend` | Yes for vision models |
| Others | No |

The `M04.LlmService._build_backends` constructs these with their vision flag.

### 5.3 Image preprocessing

For LLM context, images are:
- Loaded from blob store via CID
- Resized to backend's preferred resolution (e.g. 1024Γ—1024 for MiniCPM-V)
- Encoded base64 or sent as bytes per backend's protocol

This is opaque to the caller β€” the multimodal `messages` array is the contract.

---

## 6. Behaviour

### 6.1 Image describe lifecycle

```
caller β†’ bus.call("img.describe", (1,0), {input:{image_cid:..., task:"detailed_caption"}})
  β†’ ImageDescribeService.handle_describe
       β†’ blob_store.read_blob_bytes(image_cid)
       β†’ backend.describe(bytes, task=...)
       β†’ return ImageDescription serialised
```

### 6.2 Image generate lifecycle

```
caller β†’ bus.stream("img.generate", (1,0), {input:{prompt:"...", steps:20}})
  β†’ ImageGenerateService.handle_generate
       β†’ backend.generate(...) with progress_cb
       β†’ for each step: emit 'progress' frame
       β†’ on completion: blob_store.write_blob(image)
       β†’ emit 'done' frame with image_cid
```

### 6.3 Safety filters

- `img.generate` prompts pass through a configurable safety filter list (regex blocklist + optional LLM-based classifier)
- Generation of identifiable persons is blocked by default (configurable: `config.vision.allow_identifiable_persons`)
- NSFW filter on output (Stable Diffusion has built-in; FLUX needs separate model)
- Failed safety β†’ `bad_request` with `reason: "safety_filter"`

### 6.4 LoRA management

`FluxBackend.loras_available` lists LoRAs found in `loras_dir`. Caller can request a specific LoRA in `params.lora`. Loading a LoRA takes a few seconds on first use; cached thereafter.

Christof's existing LoRAs (local-style, sketches, etc.) drop into the `loras_dir`. The backend auto-discovers them.

### 6.5 GPU pressure

Vision models are heavy. Recommended:

- One Florence-2 instance per node (always-loaded)
- FLUX/SD only loaded on-demand (warm on first request, kept hot for 5 minutes)
- `max_concurrent = 1` for FLUX; `2` for describe backends

These limits are declared in the capability descriptor so the bus throttles correctly.

### 6.6 Multimodal LLM call routing

When a user sends a multimodal message:

```
UI β†’ bus.stream("llm.chat", (2,0), {input:{messages:[{role:"user", content:[{type:"text",text:"..."},{type:"image",image_cid:"..."}]}]}})
  β†’ Router.route filters candidates to those with modalities βŠ‡ {"vision"}
  β†’ picks best (e.g. MinicpmV local, fall back to Anthropic API)
  β†’ backend handles base64 / image-token-injection internally
```

If no vision-capable backend is online, the call returns `not_found` with a helpful `alt_capabilities` hint pointing to describe-then-text-only fallback (UI can offer this).

---

## 7. Errors

| Condition | Wire code |
|-----------|-----------|
| Unknown task | `bad_request` |
| Image too large | `bad_request` |
| Prompt safety violation | `bad_request` (reason=safety_filter) |
| LoRA not found | `not_found` |
| GPU OOM | `capacity_exceeded` |
| Backend missing for requested task | `not_implemented` |

---

## 8. Configuration

```python
config.vision.enabled                 = True
config.vision.describe_backends = [
    DescribeBackendConfig(name="florence2", model="microsoft/Florence-2-large", device="auto"),
    DescribeBackendConfig(name="minicpm_v", model="openbmb/MiniCPM-V-2_6", device="auto"),
]
config.vision.generate_backends = [
    GenerateBackendConfig(name="flux", model="black-forest-labs/FLUX.1-dev",
                          loras_dir=Path("~/.hearthnet/loras"), device="auto"),
]
config.vision.allow_identifiable_persons = False
config.vision.safety_blocklist_file       = None   # optional regex file
```

---

## 9. Tests

### Unit
- `test_describe_descriptor_per_backend`
- `test_safety_filter_blocks_known_pattern`
- `test_lora_discovery`
- `test_oom_returns_capacity_exceeded`

### Integration
- `test_florence2_caption_sample` (test image)
- `test_flux_generate_with_lora_progress_frames`
- `test_multimodal_llm_routes_to_vision_backend`
- `test_describe_then_text_fallback_when_no_vision_llm`

---

## 10. Cross-references

| What | Where |
|------|-------|
| `img.*` wire | [CAP2 Β§4.13–4.14](../CAPABILITY_CONTRACT_v2.md) |
| Multimodal `llm.chat@2.0` | [CAP2 Β§4.23](../CAPABILITY_CONTRACT_v2.md) |
| LLM service extension | M04 (extended in Phase 2 β€” see [00-OVERVIEW Β§1](../00-OVERVIEW.md)) |
| OCR overlap | [M17 Β§3.2 florence_ocr](M17-ocr.md) |
| Christof's pipelines | external, this is the integration |

---

## 11. Open questions

1. **Video** β€” Phase 3 considers `video.describe` and `video.generate` (LTX-Video). Not in Phase 2.
2. **Image editing (inpainting)** β€” Phase 2.5: `img.edit@1.0` capability. Reserved.
3. **Control nets (depth, edge, pose)** β€” Phase 2.5.
4. **3D generation** β€” Phase 3 with TripoSR or similar.
5. **Safety filter quality** β€” regex blocklist is weak. An LLM-as-judge classifier is better but adds latency. Configurable; default off.
6. **LoRA stacking** β€” caller specifies multiple `loras: ["...","..."]`. Implementable but adds attack surface (prompt-LoRA combos). Defer.