HearthNet-Nemotron

Running on Zero

App Files Files Community

HearthNet-Nemotron / docs /p2_p3 /M20-vision.md

Chris4K

p2, p3

70650b7 18 days ago

preview code

Raw

History Blame

12 kB

M20 — Vision Services

Spec version: v1.0 (Phase 2) Depends on: M03 (bus), M07 (blobs), M04 (LLM, extended), X04 (config), X03 (observability) Depended on by: M08 UI (image describe in ask tab; generate in tools), M21 tools (vision used by tool-augmented LLM), M17 OCR (Florence-2 OCR mode shared)

1. Responsibility

Two capability families:

img.describe@1.0 — given an image CID, produce a caption, tags, object list, or OCR
img.generate@1.0 — given a prompt, generate an image

Plus: extend the LLM llm.chat@2.0 request schema with multimodal content (text + image_cid in messages). The multimodal path goes through M04 backends that declare modalities: ["text", "vision"]. M20 is responsible for providing the vision backends those LLMs depend on.

Christof's existing pipelines: Florence-2 (describe), FLUX.1-dev with LoRAs (generate), MiniCPM-V (multimodal). All wired in.

2. File layout

hearthnet/services/image/
├── __init__.py
├── describe_service.py
├── generate_service.py
└── backends/
    ├── __init__.py
    ├── base.py             # ImageDescribeBackend, ImageGenerateBackend
    ├── florence2.py        # Microsoft Florence-2 large
    ├── minicpm_v.py        # OpenBMB MiniCPM-V (also usable for chat-with-vision)
    ├── flux.py             # black-forest-labs FLUX.1-dev with LoRA support
    └── stable_diffusion.py # Optional SD-XL fallback

3. Public API — describe

3.1 `backends/base.py`

@dataclass(frozen=True)
class ImageDescription:
    caption:    str
    detailed_caption: str | None
    tags:       list[str]
    objects:    list[dict]                  # [{label, bbox, confidence}]
    ocr_text:   str | None
    language:   str
    ms:         int

class ImageDescribeBackend(Protocol):
    name:               str
    tasks_supported:    list[str]           # subset of {"caption","detailed_caption","ocr","objects","tags"}
    languages:          list[str]
    max_pixels:         int

    async def warm(self) -> None: ...
    async def close(self) -> None: ...

    async def describe(
        self,
        image_bytes: bytes,
        *,
        task: str,
        language: str = "en",
    ) -> ImageDescription: ...

    def health(self) -> dict: ...

3.2 `describe_service.py`

class ImageDescribeService:
    name    = "image.describe"
    version = "1.0"

    def __init__(self, config: VisionConfig, blob_store: BlobStore):
        ...

    def capabilities(self) -> list[tuple[CapabilityDescriptor, Callable, ParamsPredicate]]:
        """One img.describe per backend. Params include backend name and tasks_supported."""

    async def start(self) -> None: ...
    async def stop(self) -> None: ...
    def health(self) -> dict: ...

    async def handle_describe(self, req: RouteRequest) -> dict:
        """CAP2 §4.13."""

3.3 Concrete describe backends

class Florence2Backend(ImageDescribeBackend):
    def __init__(self, model: str = "microsoft/Florence-2-large", device: str = "auto"):
        ...

class MinicpmVBackend(ImageDescribeBackend):
    """Used both standalone (img.describe) and as an LLM vision backend (M04 extension)."""
    def __init__(self, model: str = "openbmb/MiniCPM-V-2_6", device: str = "auto"):
        ...

4. Public API — generate

4.1 `backends/base.py`

@dataclass(frozen=True)
class GenerationResult:
    image_bytes:    bytes
    width:          int
    height:         int
    format:         str             # "png" | "webp" | "jpg"
    seed:           int
    ms:             int

class ImageGenerateBackend(Protocol):
    name:               str
    models:             list[str]
    loras_available:    list[str]
    max_resolution:     tuple[int, int]
    min_resolution:     tuple[int, int]
    supports_negative_prompt: bool

    async def warm(self, model: str) -> None: ...
    async def close(self) -> None: ...

    async def generate(
        self,
        prompt: str,
        *,
        model: str,
        lora: str | None,
        negative_prompt: str | None,
        width: int,
        height: int,
        steps: int,
        seed: int | None,
        progress_cb: Callable[[int, int], None] | None = None,
    ) -> GenerationResult: ...

    def health(self) -> dict: ...

4.2 `generate_service.py`

class ImageGenerateService:
    name    = "image.generate"
    version = "1.0"

    def __init__(self, config: VisionConfig, blob_store: BlobStore):
        ...

    def capabilities(self) -> list[tuple[CapabilityDescriptor, Callable, ParamsPredicate]]:
        """One img.generate per (backend, model) combo. params declare loras_available."""

    async def start(self) -> None: ...
    async def stop(self) -> None: ...
    def health(self) -> dict: ...

    async def handle_generate(self, req: RouteRequest) -> AsyncIterator[dict]:
        """CAP2 §4.14.
        1. Generate (streaming progress frames)
        2. Store resulting image as blob
        3. Emit done with image_cid"""

4.3 Concrete generate backends

class FluxBackend(ImageGenerateBackend):
    """FLUX.1-dev with LoRA support; Christof's existing pipeline."""
    def __init__(
        self,
        model: str = "black-forest-labs/FLUX.1-dev",
        device: str = "auto",
        loras_dir: Path = Path("~/.hearthnet/loras"),
    ):
        ...

class StableDiffusionBackend(ImageGenerateBackend):
    """SD-XL fallback for nodes with smaller GPUs."""
    def __init__(self, model: str = "stabilityai/stable-diffusion-xl-base-1.0", device: str = "auto"):
        ...

5. Multimodal LLM extension (M04 hook)

5.1 Message content array

In llm.chat@2.0 (CAP2 §4.23), each messages[].content may be a list:

[
  {"type": "text",  "text": "Was ist auf diesem Bild?"},
  {"type": "image", "image_cid": "blake3:..."}
]

Backends that declare modalities: ["text", "vision"] in their descriptor must handle the array form. Backends that don't either:

Are skipped by the router (params_compatible returns False when message contains image and modalities ⊉ {"vision"})
Or fall back: extract text content only, ignore images (worse UX; not recommended)

5.2 Vision-capable backends in M04

These M04 backends gain a modalities: ["text","vision"] declaration in Phase 2:

Backend	Vision support
`MinicpmVBackend` (M04 entry — same model as M20's describe)	Yes; native multimodal
`AnthropicApiBackend`	Yes; Claude vision via API
`OpenAiApiBackend`	Yes; GPT-4V
`Llava` (new, optional)	Yes; LLaVA via llama.cpp
`LlamaCppBackend`	Yes if model is multimodal (LLaVA-format)
`OllamaBackend`	Yes for vision models
Others	No

The M04.LlmService._build_backends constructs these with their vision flag.

5.3 Image preprocessing

For LLM context, images are:

Loaded from blob store via CID
Resized to backend's preferred resolution (e.g. 1024×1024 for MiniCPM-V)
Encoded base64 or sent as bytes per backend's protocol

This is opaque to the caller — the multimodal messages array is the contract.

6. Behaviour

6.1 Image describe lifecycle

caller → bus.call("img.describe", (1,0), {input:{image_cid:..., task:"detailed_caption"}})
  → ImageDescribeService.handle_describe
       → blob_store.read_blob_bytes(image_cid)
       → backend.describe(bytes, task=...)
       → return ImageDescription serialised

6.2 Image generate lifecycle

caller → bus.stream("img.generate", (1,0), {input:{prompt:"...", steps:20}})
  → ImageGenerateService.handle_generate
       → backend.generate(...) with progress_cb
       → for each step: emit 'progress' frame
       → on completion: blob_store.write_blob(image)
       → emit 'done' frame with image_cid

6.3 Safety filters

img.generate prompts pass through a configurable safety filter list (regex blocklist + optional LLM-based classifier)
Generation of identifiable persons is blocked by default (configurable: config.vision.allow_identifiable_persons)
NSFW filter on output (Stable Diffusion has built-in; FLUX needs separate model)
Failed safety → bad_request with reason: "safety_filter"

6.4 LoRA management

FluxBackend.loras_available lists LoRAs found in loras_dir. Caller can request a specific LoRA in params.lora. Loading a LoRA takes a few seconds on first use; cached thereafter.

Christof's existing LoRAs (local-style, sketches, etc.) drop into the loras_dir. The backend auto-discovers them.

6.5 GPU pressure

Vision models are heavy. Recommended:

One Florence-2 instance per node (always-loaded)
FLUX/SD only loaded on-demand (warm on first request, kept hot for 5 minutes)
max_concurrent = 1 for FLUX; 2 for describe backends

These limits are declared in the capability descriptor so the bus throttles correctly.

6.6 Multimodal LLM call routing

When a user sends a multimodal message:

UI → bus.stream("llm.chat", (2,0), {input:{messages:[{role:"user", content:[{type:"text",text:"..."},{type:"image",image_cid:"..."}]}]}})
  → Router.route filters candidates to those with modalities ⊇ {"vision"}
  → picks best (e.g. MinicpmV local, fall back to Anthropic API)
  → backend handles base64 / image-token-injection internally

If no vision-capable backend is online, the call returns not_found with a helpful alt_capabilities hint pointing to describe-then-text-only fallback (UI can offer this).

7. Errors

Condition	Wire code
Unknown task	`bad_request`
Image too large	`bad_request`
Prompt safety violation	`bad_request` (reason=safety_filter)
LoRA not found	`not_found`
GPU OOM	`capacity_exceeded`
Backend missing for requested task	`not_implemented`

8. Configuration

config.vision.enabled                 = True
config.vision.describe_backends = [
    DescribeBackendConfig(name="florence2", model="microsoft/Florence-2-large", device="auto"),
    DescribeBackendConfig(name="minicpm_v", model="openbmb/MiniCPM-V-2_6", device="auto"),
]
config.vision.generate_backends = [
    GenerateBackendConfig(name="flux", model="black-forest-labs/FLUX.1-dev",
                          loras_dir=Path("~/.hearthnet/loras"), device="auto"),
]
config.vision.allow_identifiable_persons = False
config.vision.safety_blocklist_file       = None   # optional regex file

9. Tests

Unit

test_describe_descriptor_per_backend
test_safety_filter_blocks_known_pattern
test_lora_discovery
test_oom_returns_capacity_exceeded

Integration

test_florence2_caption_sample (test image)
test_flux_generate_with_lora_progress_frames
test_multimodal_llm_routes_to_vision_backend
test_describe_then_text_fallback_when_no_vision_llm

10. Cross-references

What	Where
`img.*` wire	CAP2 §4.13–4.14
Multimodal `llm.chat@2.0`	CAP2 §4.23
LLM service extension	M04 (extended in Phase 2 — see 00-OVERVIEW §1)
OCR overlap	M17 §3.2 florence_ocr
Christof's pipelines	external, this is the integration

11. Open questions

Video — Phase 3 considers video.describe and video.generate (LTX-Video). Not in Phase 2.
Image editing (inpainting) — Phase 2.5: img.edit@1.0 capability. Reserved.
Control nets (depth, edge, pose) — Phase 2.5.
3D generation — Phase 3 with TripoSR or similar.
Safety filter quality — regex blocklist is weak. An LLM-as-judge classifier is better but adds latency. Configurable; default off.
LoRA stacking — caller specifies multiple loras: ["...","..."]. Implementable but adds attack surface (prompt-LoRA combos). Defer.