Spaces:
Running on Zero
M20 β Vision Services
Spec version: v1.0 (Phase 2) Depends on: M03 (bus), M07 (blobs), M04 (LLM, extended), X04 (config), X03 (observability) Depended on by: M08 UI (image describe in ask tab; generate in tools), M21 tools (vision used by tool-augmented LLM), M17 OCR (Florence-2 OCR mode shared)
1. Responsibility
Two capability families:
img.describe@1.0β given an image CID, produce a caption, tags, object list, or OCRimg.generate@1.0β given a prompt, generate an image
Plus: extend the LLM llm.chat@2.0 request schema with multimodal content (text + image_cid in messages). The multimodal path goes through M04 backends that declare modalities: ["text", "vision"]. M20 is responsible for providing the vision backends those LLMs depend on.
Christof's existing pipelines: Florence-2 (describe), FLUX.1-dev with LoRAs (generate), MiniCPM-V (multimodal). All wired in.
2. File layout
hearthnet/services/image/
βββ __init__.py
βββ describe_service.py
βββ generate_service.py
βββ backends/
βββ __init__.py
βββ base.py # ImageDescribeBackend, ImageGenerateBackend
βββ florence2.py # Microsoft Florence-2 large
βββ minicpm_v.py # OpenBMB MiniCPM-V (also usable for chat-with-vision)
βββ flux.py # black-forest-labs FLUX.1-dev with LoRA support
βββ stable_diffusion.py # Optional SD-XL fallback
3. Public API β describe
3.1 backends/base.py
@dataclass(frozen=True)
class ImageDescription:
caption: str
detailed_caption: str | None
tags: list[str]
objects: list[dict] # [{label, bbox, confidence}]
ocr_text: str | None
language: str
ms: int
class ImageDescribeBackend(Protocol):
name: str
tasks_supported: list[str] # subset of {"caption","detailed_caption","ocr","objects","tags"}
languages: list[str]
max_pixels: int
async def warm(self) -> None: ...
async def close(self) -> None: ...
async def describe(
self,
image_bytes: bytes,
*,
task: str,
language: str = "en",
) -> ImageDescription: ...
def health(self) -> dict: ...
3.2 describe_service.py
class ImageDescribeService:
name = "image.describe"
version = "1.0"
def __init__(self, config: VisionConfig, blob_store: BlobStore):
...
def capabilities(self) -> list[tuple[CapabilityDescriptor, Callable, ParamsPredicate]]:
"""One img.describe per backend. Params include backend name and tasks_supported."""
async def start(self) -> None: ...
async def stop(self) -> None: ...
def health(self) -> dict: ...
async def handle_describe(self, req: RouteRequest) -> dict:
"""CAP2 Β§4.13."""
3.3 Concrete describe backends
class Florence2Backend(ImageDescribeBackend):
def __init__(self, model: str = "microsoft/Florence-2-large", device: str = "auto"):
...
class MinicpmVBackend(ImageDescribeBackend):
"""Used both standalone (img.describe) and as an LLM vision backend (M04 extension)."""
def __init__(self, model: str = "openbmb/MiniCPM-V-2_6", device: str = "auto"):
...
4. Public API β generate
4.1 backends/base.py
@dataclass(frozen=True)
class GenerationResult:
image_bytes: bytes
width: int
height: int
format: str # "png" | "webp" | "jpg"
seed: int
ms: int
class ImageGenerateBackend(Protocol):
name: str
models: list[str]
loras_available: list[str]
max_resolution: tuple[int, int]
min_resolution: tuple[int, int]
supports_negative_prompt: bool
async def warm(self, model: str) -> None: ...
async def close(self) -> None: ...
async def generate(
self,
prompt: str,
*,
model: str,
lora: str | None,
negative_prompt: str | None,
width: int,
height: int,
steps: int,
seed: int | None,
progress_cb: Callable[[int, int], None] | None = None,
) -> GenerationResult: ...
def health(self) -> dict: ...
4.2 generate_service.py
class ImageGenerateService:
name = "image.generate"
version = "1.0"
def __init__(self, config: VisionConfig, blob_store: BlobStore):
...
def capabilities(self) -> list[tuple[CapabilityDescriptor, Callable, ParamsPredicate]]:
"""One img.generate per (backend, model) combo. params declare loras_available."""
async def start(self) -> None: ...
async def stop(self) -> None: ...
def health(self) -> dict: ...
async def handle_generate(self, req: RouteRequest) -> AsyncIterator[dict]:
"""CAP2 Β§4.14.
1. Generate (streaming progress frames)
2. Store resulting image as blob
3. Emit done with image_cid"""
4.3 Concrete generate backends
class FluxBackend(ImageGenerateBackend):
"""FLUX.1-dev with LoRA support; Christof's existing pipeline."""
def __init__(
self,
model: str = "black-forest-labs/FLUX.1-dev",
device: str = "auto",
loras_dir: Path = Path("~/.hearthnet/loras"),
):
...
class StableDiffusionBackend(ImageGenerateBackend):
"""SD-XL fallback for nodes with smaller GPUs."""
def __init__(self, model: str = "stabilityai/stable-diffusion-xl-base-1.0", device: str = "auto"):
...
5. Multimodal LLM extension (M04 hook)
5.1 Message content array
In llm.chat@2.0 (CAP2 Β§4.23), each messages[].content may be a list:
[
{"type": "text", "text": "Was ist auf diesem Bild?"},
{"type": "image", "image_cid": "blake3:..."}
]
Backends that declare modalities: ["text", "vision"] in their descriptor must handle the array form. Backends that don't either:
- Are skipped by the router (params_compatible returns False when message contains image and
modalities β {"vision"}) - Or fall back: extract text content only, ignore images (worse UX; not recommended)
5.2 Vision-capable backends in M04
These M04 backends gain a modalities: ["text","vision"] declaration in Phase 2:
| Backend | Vision support |
|---|---|
MinicpmVBackend (M04 entry β same model as M20's describe) |
Yes; native multimodal |
AnthropicApiBackend |
Yes; Claude vision via API |
OpenAiApiBackend |
Yes; GPT-4V |
Llava (new, optional) |
Yes; LLaVA via llama.cpp |
LlamaCppBackend |
Yes if model is multimodal (LLaVA-format) |
OllamaBackend |
Yes for vision models |
| Others | No |
The M04.LlmService._build_backends constructs these with their vision flag.
5.3 Image preprocessing
For LLM context, images are:
- Loaded from blob store via CID
- Resized to backend's preferred resolution (e.g. 1024Γ1024 for MiniCPM-V)
- Encoded base64 or sent as bytes per backend's protocol
This is opaque to the caller β the multimodal messages array is the contract.
6. Behaviour
6.1 Image describe lifecycle
caller β bus.call("img.describe", (1,0), {input:{image_cid:..., task:"detailed_caption"}})
β ImageDescribeService.handle_describe
β blob_store.read_blob_bytes(image_cid)
β backend.describe(bytes, task=...)
β return ImageDescription serialised
6.2 Image generate lifecycle
caller β bus.stream("img.generate", (1,0), {input:{prompt:"...", steps:20}})
β ImageGenerateService.handle_generate
β backend.generate(...) with progress_cb
β for each step: emit 'progress' frame
β on completion: blob_store.write_blob(image)
β emit 'done' frame with image_cid
6.3 Safety filters
img.generateprompts pass through a configurable safety filter list (regex blocklist + optional LLM-based classifier)- Generation of identifiable persons is blocked by default (configurable:
config.vision.allow_identifiable_persons) - NSFW filter on output (Stable Diffusion has built-in; FLUX needs separate model)
- Failed safety β
bad_requestwithreason: "safety_filter"
6.4 LoRA management
FluxBackend.loras_available lists LoRAs found in loras_dir. Caller can request a specific LoRA in params.lora. Loading a LoRA takes a few seconds on first use; cached thereafter.
Christof's existing LoRAs (local-style, sketches, etc.) drop into the loras_dir. The backend auto-discovers them.
6.5 GPU pressure
Vision models are heavy. Recommended:
- One Florence-2 instance per node (always-loaded)
- FLUX/SD only loaded on-demand (warm on first request, kept hot for 5 minutes)
max_concurrent = 1for FLUX;2for describe backends
These limits are declared in the capability descriptor so the bus throttles correctly.
6.6 Multimodal LLM call routing
When a user sends a multimodal message:
UI β bus.stream("llm.chat", (2,0), {input:{messages:[{role:"user", content:[{type:"text",text:"..."},{type:"image",image_cid:"..."}]}]}})
β Router.route filters candidates to those with modalities β {"vision"}
β picks best (e.g. MinicpmV local, fall back to Anthropic API)
β backend handles base64 / image-token-injection internally
If no vision-capable backend is online, the call returns not_found with a helpful alt_capabilities hint pointing to describe-then-text-only fallback (UI can offer this).
7. Errors
| Condition | Wire code |
|---|---|
| Unknown task | bad_request |
| Image too large | bad_request |
| Prompt safety violation | bad_request (reason=safety_filter) |
| LoRA not found | not_found |
| GPU OOM | capacity_exceeded |
| Backend missing for requested task | not_implemented |
8. Configuration
config.vision.enabled = True
config.vision.describe_backends = [
DescribeBackendConfig(name="florence2", model="microsoft/Florence-2-large", device="auto"),
DescribeBackendConfig(name="minicpm_v", model="openbmb/MiniCPM-V-2_6", device="auto"),
]
config.vision.generate_backends = [
GenerateBackendConfig(name="flux", model="black-forest-labs/FLUX.1-dev",
loras_dir=Path("~/.hearthnet/loras"), device="auto"),
]
config.vision.allow_identifiable_persons = False
config.vision.safety_blocklist_file = None # optional regex file
9. Tests
Unit
test_describe_descriptor_per_backendtest_safety_filter_blocks_known_patterntest_lora_discoverytest_oom_returns_capacity_exceeded
Integration
test_florence2_caption_sample(test image)test_flux_generate_with_lora_progress_framestest_multimodal_llm_routes_to_vision_backendtest_describe_then_text_fallback_when_no_vision_llm
10. Cross-references
| What | Where |
|---|---|
img.* wire |
CAP2 Β§4.13β4.14 |
Multimodal llm.chat@2.0 |
CAP2 Β§4.23 |
| LLM service extension | M04 (extended in Phase 2 β see 00-OVERVIEW Β§1) |
| OCR overlap | M17 Β§3.2 florence_ocr |
| Christof's pipelines | external, this is the integration |
11. Open questions
- Video β Phase 3 considers
video.describeandvideo.generate(LTX-Video). Not in Phase 2. - Image editing (inpainting) β Phase 2.5:
img.edit@1.0capability. Reserved. - Control nets (depth, edge, pose) β Phase 2.5.
- 3D generation β Phase 3 with TripoSR or similar.
- Safety filter quality β regex blocklist is weak. An LLM-as-judge classifier is better but adds latency. Configurable; default off.
- LoRA stacking β caller specifies multiple
loras: ["...","..."]. Implementable but adds attack surface (prompt-LoRA combos). Defer.