HearthNet-Nemotron

Running on Zero

File size: 11,375 Bytes

70650b7

# M17 — OCR Service

**Spec version:** v1.0 (Phase 2)
**Depends on:** M03 (bus), M07 (blobs, for reading image/PDF inputs and storing extracted text), M11 (embedding, when integrating with M05 RAG), X04 (config), X03 (observability)
**Depended on by:** M05 RAG (ingest of scanned PDFs), M20 vision (img.describe can fall back to OCR for text-heavy images)

---

## 1. Responsibility

Provide `ocr.image@1.0` and `ocr.pdf@1.0`. Wrap several OCR backends so the bus can route between them by document type and language. Specifically engineered to handle:

- Modern German printed text (Tesseract)
- Handwriting (TrOCR / Microsoft Florence-OCR)
- Historical scripts — Sütterlin, Kurrent, Latin, Arabic, Cyrillic (Christof's multilingual harness)
- Mixed-language documents (auto-detection)

---

## 2. File layout

```
hearthnet/services/ocr/
├── __init__.py
├── service.py            # OcrService
└── backends/
    ├── __init__.py
    ├── base.py           # OcrBackend Protocol
    ├── tesseract.py      # Tesseract via pytesseract
    ├── trocr.py          # Microsoft TrOCR via transformers
    ├── multilingual.py   # Christof's self-improving harness (CHURRO, olmOCR-2)
    └── florence_ocr.py   # Florence-2 OCR mode (overlap with M20)
```

---

## 3. Public API

### 3.1 `backends/base.py`

```python
# hearthnet/services/ocr/backends/base.py
from dataclasses import dataclass

@dataclass(frozen=True)
class OcrBlock:
    text:        str
    bbox:        tuple[int, int, int, int]   # (x, y, w, h) in pixel coords
    confidence:  float                       # 0..1
    language:    str | None

@dataclass(frozen=True)
class OcrPageResult:
    page:             int                    # 1-indexed
    text:             str                    # concatenated, reading order
    blocks:           list[OcrBlock]
    languages:        list[str]              # detected, ordered by prevalence
    confidence_mean:  float
    ms:               int

class OcrBackend(Protocol):
    name:               str       # "tesseract" | "trocr" | "multilingual" | "florence_ocr"
    languages_supported: list[str] # ISO 639-2 codes: "deu","eng","lat","ara","rus", ...
    supports_handwriting: bool
    max_image_pixels:   int

    async def warm(self) -> None: ...
    async def close(self) -> None: ...

    async def ocr_image(
        self,
        image_bytes: bytes,
        *,
        languages: list[str] | None,        # None → auto-detect
        preprocess: dict | None = None,     # {deskew, denoise, dpi}
    ) -> OcrPageResult: ...

    async def ocr_pdf_page(
        self,
        pdf_bytes: bytes,
        *,
        page: int,
        languages: list[str] | None,
        preprocess: dict | None = None,
    ) -> OcrPageResult: ...

    def health(self) -> dict: ...
```

### 3.2 Concrete backends

| File | Class | Notes |
|------|-------|-------|
| `backends/tesseract.py` | `TesseractBackend(min_confidence: float = 0.5)` | Languages: any installed traineddata. Subprocess via pytesseract. |
| `backends/trocr.py` | `TrocrBackend(model: str = "microsoft/trocr-large-handwritten", device: str = "auto")` | Handwriting; CUDA preferred. |
| `backends/multilingual.py` | `MultilingualHarnessBackend(model: str = "self-improving-ocr-v1", device: str = "auto", harness_dir: Path)` | Christof's harness (CHURRO, olmOCR-2, retrieval-augmented correction, Kurrent/Sütterlin/Latin/Arabic/Cyrillic). Configured via `harness_dir`. |
| `backends/florence_ocr.py` | `FlorenceOcrBackend(model: str = "microsoft/Florence-2-large")` | Reuses M20 vision backend in OCR mode. |

`MultilingualHarnessBackend` is the headline integration for Christof's existing work. It exposes the same `OcrBackend` interface and lets the harness's internal page-level VLMs do the heavy lifting.

### 3.3 `service.py`

```python
# hearthnet/services/ocr/service.py
class OcrService:
    name    = "ocr"
    version = "1.0"

    def __init__(self, config: OcrConfig, blob_store: BlobStore, event_log: EventLog):
        self._backends: dict[str, OcrBackend] = self._build_backends(config)

    def capabilities(self) -> list[tuple[CapabilityDescriptor, Callable, ParamsPredicate]]:
        """One ocr.image entry per backend; one ocr.pdf entry per backend.
        params include backend name and supported languages."""

    async def start(self) -> None: ...
    async def stop(self) -> None: ...
    def health(self) -> dict: ...

    # --- handlers ---

    async def handle_image(self, req: RouteRequest) -> dict:
        """CAP2 §4.8.
        1. Resolve image_cid via blob_store
        2. Pick backend from params.backend
        3. Run; build response"""

    async def handle_pdf(self, req: RouteRequest) -> AsyncIterator[dict]:
        """CAP2 §4.9.
        1. Resolve doc_cid
        2. For each page in page_range:
           emit 'progress' frame
           emit 'page' frame
        3. If store_text:true, write concatenated text as new blob; emit done with text_cid"""
```

### 3.4 `params_compatible` predicate

```python
def params_compatible(offered: dict, requested: dict) -> bool:
    # backend must match if specified
    if "backend" in requested and requested["backend"] != offered.get("backend"):
        return False
    # all requested languages must be supported by this backend
    requested_langs = set(requested.get("languages", []))
    offered_langs   = set(offered.get("languages_supported", []))
    return requested_langs.issubset(offered_langs)
```

---

## 4. Behaviour

### 4.1 Auto-language detection

If `languages` is omitted or set to `["auto"]`:

1. Sample 3 random pages
2. Run lightweight script detection (Tesseract `osd`)
3. Choose top 2 scripts
4. Re-run with that language set

Backends that don't support `osd` fall back to a fixed default (configured per backend).

### 4.2 Preprocessing pipeline

`preprocess` dict supports:
- `deskew: bool` — straighten image
- `denoise: bool` — bilateral filter
- `binarize: bool` — Otsu threshold
- `dpi: int` — target resolution; upscale if lower
- `contrast_normalise: bool`

Default: `{"deskew": true, "denoise": false}`. Heavy preprocessing slows ingest meaningfully; only enable per document.

### 4.3 Quality estimation

Each page result reports `confidence_mean`. Below 0.6, the service emits a `low_quality` warning frame and recommends:
- Trying a different backend (e.g. switch from Tesseract to multilingual harness for historic text)
- Raising DPI
- Re-scanning

### 4.4 Integration with RAG

[M05 §10 open question 4](../../modules/M05-rag.md) is now answered:

```
RagService.handle_ingest receives a scanned PDF (mime_type=image/scanned-pdf or detected)
  → bus.call("ocr.pdf", (1,0), {input:{doc_cid:..., store_text:true}})
  → receive text_cid
  → ingest the text_cid blob (which is now extracted plaintext) as normal
  → emit rag.document.ingested event (with metadata noting ocr_backend used)
```

The OCR text is stored as a separate blob, content-addressed. Re-ingestion is idempotent.

### 4.5 Page-range and parallelism

`page_range: [1, 50]` lets callers process partial documents. Pages are OCR'd serially within one call. For very large PDFs, callers should split into ranges and call concurrently — the bus enforces per-capability concurrency.

`OCR_MAX_PAGES_PER_REQUEST = 50` is the hard ceiling per call.

### 4.6 PDF text-layer detection

Before OCR'ing, the service checks if the PDF has an extractable text layer (via `pypdf`). If yes and confidence is decent (heuristic), it returns the text-layer content directly — much cheaper than OCR. Caller can force OCR with `force_ocr: true`.

### 4.7 Christof's multilingual harness integration

The `MultilingualHarnessBackend` wraps Christof's existing self-improving OCR pipeline:

- Internal models: CHURRO (page-level VLM), olmOCR-2 (page-level VLM)
- Retrieval-augmented correction over a script-specific corpus
- Kurrent + Sütterlin support for German historical documents
- Latin / Arabic / Cyrillic script recognition

Configuration:

```python
config.ocr.multilingual_harness_dir = Path("/srv/ocr-harness")
config.ocr.multilingual_max_pages_concurrent = 2
```

The harness is GPU-intensive. On CPU-only nodes, it deregisters itself at startup.

---

## 5. Storage and lifecycle

- Input image/PDF: fetched from blob store via CID
- Output text: optionally stored as a new blob (`store_text: true`)
- Side-effect: `ocr.document.indexed` event in the community log (carries text_cid for downstream replication)

OCR backends do NOT cache results inside themselves. Reuse comes from caching at the RAG/blob layer (same `doc_cid` → already-extracted-text blob).

---

## 6. Errors

| Condition | Wire code |
|-----------|-----------|
| Unknown backend | `not_found` |
| Languages not supported by any backend | `bad_request` |
| Image too large (> max_image_pixels) | `bad_request` |
| Page-range exceeds document | `bad_request` |
| > OCR_MAX_PAGES_PER_REQUEST | `bad_request` |
| Backend crash | `internal_error` |
| GPU OOM on multilingual | `capacity_exceeded` (with retry_after) |

---

## 7. Configuration

```python
config.ocr.enabled                 = True
config.ocr.backends = [
    OcrBackendConfig(name="tesseract", languages=["deu","eng","fra","lat"]),
    OcrBackendConfig(name="trocr", model="microsoft/trocr-large-handwritten"),
    OcrBackendConfig(name="multilingual", harness_dir=Path("/srv/ocr-harness")),
]
config.ocr.default_dpi             = OCR_DEFAULT_DPI    # 300
config.ocr.max_pages_per_request   = OCR_MAX_PAGES_PER_REQUEST
config.ocr.text_layer_first        = True
```

Constants: `OCR_DEFAULT_DPI`, `OCR_MAX_PAGES_PER_REQUEST`.

---

## 8. Tests

### Unit
- `test_descriptor_schema_validates_meta_schema`
- `test_params_compatible_language_subset`
- `test_text_layer_short_circuits_when_present`
- `test_force_ocr_bypasses_text_layer`
- `test_low_quality_emits_warning_frame`

### Integration
- `test_tesseract_german_print` (with a known sample)
- `test_trocr_handwriting_sample`
- `test_multilingual_kurrent_sample` (if harness installed)
- `test_rag_ingest_scanned_pdf_end_to_end`
- `test_ocr_pdf_progress_frames`

---

## 9. Cross-references

| What | Where |
|------|-------|
| `ocr.*` wire | [CAP2 §4.8–4.9](../CAPABILITY_CONTRACT_v2.md) |
| Blob store dependency | [M07 §3](../../modules/M07-file-blobs.md) |
| RAG integration | [M05 §10 q4](../../modules/M05-rag.md) — now resolved |
| Vision overlap (Florence-2 OCR mode) | [M20 §4.3](M20-vision.md) |
| `ocr.document.indexed` event | [CAP2 §7.1](../CAPABILITY_CONTRACT_v2.md) |
| Christof's harness | external project; this module is the integration surface |

---

## 10. Open questions

1. **Multilingual harness auto-update.** The harness self-improves; should the model versions be event-logged so we can replay deterministically? Yes — record the harness version hash in each `ocr.document.indexed` event.
2. **Manuscript-quality preprocessing.** Some historic documents need bespoke preprocessing (e.g. ink-bleed removal). Phase 2.5 might add a `preprocess_profile` enum.
3. **Reading order from layout.** Currently we trust the backend's reading order. For multi-column documents, an explicit layout model (LayoutLMv3) might help. Phase 3.
4. **Streaming OCR for very large images.** Currently atomic. Could tile and stream. Defer.