HearthNet-Nemotron / docs /p2_p3 /M17-ocr.md
Chris4K's picture
p2, p3
70650b7
|
Raw
History Blame
11.4 kB

M17 β€” OCR Service

Spec version: v1.0 (Phase 2) Depends on: M03 (bus), M07 (blobs, for reading image/PDF inputs and storing extracted text), M11 (embedding, when integrating with M05 RAG), X04 (config), X03 (observability) Depended on by: M05 RAG (ingest of scanned PDFs), M20 vision (img.describe can fall back to OCR for text-heavy images)


1. Responsibility

Provide ocr.image@1.0 and ocr.pdf@1.0. Wrap several OCR backends so the bus can route between them by document type and language. Specifically engineered to handle:

  • Modern German printed text (Tesseract)
  • Handwriting (TrOCR / Microsoft Florence-OCR)
  • Historical scripts β€” SΓΌtterlin, Kurrent, Latin, Arabic, Cyrillic (Christof's multilingual harness)
  • Mixed-language documents (auto-detection)

2. File layout

hearthnet/services/ocr/
β”œβ”€β”€ __init__.py
β”œβ”€β”€ service.py            # OcrService
└── backends/
    β”œβ”€β”€ __init__.py
    β”œβ”€β”€ base.py           # OcrBackend Protocol
    β”œβ”€β”€ tesseract.py      # Tesseract via pytesseract
    β”œβ”€β”€ trocr.py          # Microsoft TrOCR via transformers
    β”œβ”€β”€ multilingual.py   # Christof's self-improving harness (CHURRO, olmOCR-2)
    └── florence_ocr.py   # Florence-2 OCR mode (overlap with M20)

3. Public API

3.1 backends/base.py

# hearthnet/services/ocr/backends/base.py
from dataclasses import dataclass

@dataclass(frozen=True)
class OcrBlock:
    text:        str
    bbox:        tuple[int, int, int, int]   # (x, y, w, h) in pixel coords
    confidence:  float                       # 0..1
    language:    str | None

@dataclass(frozen=True)
class OcrPageResult:
    page:             int                    # 1-indexed
    text:             str                    # concatenated, reading order
    blocks:           list[OcrBlock]
    languages:        list[str]              # detected, ordered by prevalence
    confidence_mean:  float
    ms:               int

class OcrBackend(Protocol):
    name:               str       # "tesseract" | "trocr" | "multilingual" | "florence_ocr"
    languages_supported: list[str] # ISO 639-2 codes: "deu","eng","lat","ara","rus", ...
    supports_handwriting: bool
    max_image_pixels:   int

    async def warm(self) -> None: ...
    async def close(self) -> None: ...

    async def ocr_image(
        self,
        image_bytes: bytes,
        *,
        languages: list[str] | None,        # None β†’ auto-detect
        preprocess: dict | None = None,     # {deskew, denoise, dpi}
    ) -> OcrPageResult: ...

    async def ocr_pdf_page(
        self,
        pdf_bytes: bytes,
        *,
        page: int,
        languages: list[str] | None,
        preprocess: dict | None = None,
    ) -> OcrPageResult: ...

    def health(self) -> dict: ...

3.2 Concrete backends

File Class Notes
backends/tesseract.py TesseractBackend(min_confidence: float = 0.5) Languages: any installed traineddata. Subprocess via pytesseract.
backends/trocr.py TrocrBackend(model: str = "microsoft/trocr-large-handwritten", device: str = "auto") Handwriting; CUDA preferred.
backends/multilingual.py MultilingualHarnessBackend(model: str = "self-improving-ocr-v1", device: str = "auto", harness_dir: Path) Christof's harness (CHURRO, olmOCR-2, retrieval-augmented correction, Kurrent/SΓΌtterlin/Latin/Arabic/Cyrillic). Configured via harness_dir.
backends/florence_ocr.py FlorenceOcrBackend(model: str = "microsoft/Florence-2-large") Reuses M20 vision backend in OCR mode.

MultilingualHarnessBackend is the headline integration for Christof's existing work. It exposes the same OcrBackend interface and lets the harness's internal page-level VLMs do the heavy lifting.

3.3 service.py

# hearthnet/services/ocr/service.py
class OcrService:
    name    = "ocr"
    version = "1.0"

    def __init__(self, config: OcrConfig, blob_store: BlobStore, event_log: EventLog):
        self._backends: dict[str, OcrBackend] = self._build_backends(config)

    def capabilities(self) -> list[tuple[CapabilityDescriptor, Callable, ParamsPredicate]]:
        """One ocr.image entry per backend; one ocr.pdf entry per backend.
        params include backend name and supported languages."""

    async def start(self) -> None: ...
    async def stop(self) -> None: ...
    def health(self) -> dict: ...

    # --- handlers ---

    async def handle_image(self, req: RouteRequest) -> dict:
        """CAP2 Β§4.8.
        1. Resolve image_cid via blob_store
        2. Pick backend from params.backend
        3. Run; build response"""

    async def handle_pdf(self, req: RouteRequest) -> AsyncIterator[dict]:
        """CAP2 Β§4.9.
        1. Resolve doc_cid
        2. For each page in page_range:
           emit 'progress' frame
           emit 'page' frame
        3. If store_text:true, write concatenated text as new blob; emit done with text_cid"""

3.4 params_compatible predicate

def params_compatible(offered: dict, requested: dict) -> bool:
    # backend must match if specified
    if "backend" in requested and requested["backend"] != offered.get("backend"):
        return False
    # all requested languages must be supported by this backend
    requested_langs = set(requested.get("languages", []))
    offered_langs   = set(offered.get("languages_supported", []))
    return requested_langs.issubset(offered_langs)

4. Behaviour

4.1 Auto-language detection

If languages is omitted or set to ["auto"]:

  1. Sample 3 random pages
  2. Run lightweight script detection (Tesseract osd)
  3. Choose top 2 scripts
  4. Re-run with that language set

Backends that don't support osd fall back to a fixed default (configured per backend).

4.2 Preprocessing pipeline

preprocess dict supports:

  • deskew: bool β€” straighten image
  • denoise: bool β€” bilateral filter
  • binarize: bool β€” Otsu threshold
  • dpi: int β€” target resolution; upscale if lower
  • contrast_normalise: bool

Default: {"deskew": true, "denoise": false}. Heavy preprocessing slows ingest meaningfully; only enable per document.

4.3 Quality estimation

Each page result reports confidence_mean. Below 0.6, the service emits a low_quality warning frame and recommends:

  • Trying a different backend (e.g. switch from Tesseract to multilingual harness for historic text)
  • Raising DPI
  • Re-scanning

4.4 Integration with RAG

M05 Β§10 open question 4 is now answered:

RagService.handle_ingest receives a scanned PDF (mime_type=image/scanned-pdf or detected)
  β†’ bus.call("ocr.pdf", (1,0), {input:{doc_cid:..., store_text:true}})
  β†’ receive text_cid
  β†’ ingest the text_cid blob (which is now extracted plaintext) as normal
  β†’ emit rag.document.ingested event (with metadata noting ocr_backend used)

The OCR text is stored as a separate blob, content-addressed. Re-ingestion is idempotent.

4.5 Page-range and parallelism

page_range: [1, 50] lets callers process partial documents. Pages are OCR'd serially within one call. For very large PDFs, callers should split into ranges and call concurrently β€” the bus enforces per-capability concurrency.

OCR_MAX_PAGES_PER_REQUEST = 50 is the hard ceiling per call.

4.6 PDF text-layer detection

Before OCR'ing, the service checks if the PDF has an extractable text layer (via pypdf). If yes and confidence is decent (heuristic), it returns the text-layer content directly β€” much cheaper than OCR. Caller can force OCR with force_ocr: true.

4.7 Christof's multilingual harness integration

The MultilingualHarnessBackend wraps Christof's existing self-improving OCR pipeline:

  • Internal models: CHURRO (page-level VLM), olmOCR-2 (page-level VLM)
  • Retrieval-augmented correction over a script-specific corpus
  • Kurrent + SΓΌtterlin support for German historical documents
  • Latin / Arabic / Cyrillic script recognition

Configuration:

config.ocr.multilingual_harness_dir = Path("/srv/ocr-harness")
config.ocr.multilingual_max_pages_concurrent = 2

The harness is GPU-intensive. On CPU-only nodes, it deregisters itself at startup.


5. Storage and lifecycle

  • Input image/PDF: fetched from blob store via CID
  • Output text: optionally stored as a new blob (store_text: true)
  • Side-effect: ocr.document.indexed event in the community log (carries text_cid for downstream replication)

OCR backends do NOT cache results inside themselves. Reuse comes from caching at the RAG/blob layer (same doc_cid β†’ already-extracted-text blob).


6. Errors

Condition Wire code
Unknown backend not_found
Languages not supported by any backend bad_request
Image too large (> max_image_pixels) bad_request
Page-range exceeds document bad_request
> OCR_MAX_PAGES_PER_REQUEST bad_request
Backend crash internal_error
GPU OOM on multilingual capacity_exceeded (with retry_after)

7. Configuration

config.ocr.enabled                 = True
config.ocr.backends = [
    OcrBackendConfig(name="tesseract", languages=["deu","eng","fra","lat"]),
    OcrBackendConfig(name="trocr", model="microsoft/trocr-large-handwritten"),
    OcrBackendConfig(name="multilingual", harness_dir=Path("/srv/ocr-harness")),
]
config.ocr.default_dpi             = OCR_DEFAULT_DPI    # 300
config.ocr.max_pages_per_request   = OCR_MAX_PAGES_PER_REQUEST
config.ocr.text_layer_first        = True

Constants: OCR_DEFAULT_DPI, OCR_MAX_PAGES_PER_REQUEST.


8. Tests

Unit

  • test_descriptor_schema_validates_meta_schema
  • test_params_compatible_language_subset
  • test_text_layer_short_circuits_when_present
  • test_force_ocr_bypasses_text_layer
  • test_low_quality_emits_warning_frame

Integration

  • test_tesseract_german_print (with a known sample)
  • test_trocr_handwriting_sample
  • test_multilingual_kurrent_sample (if harness installed)
  • test_rag_ingest_scanned_pdf_end_to_end
  • test_ocr_pdf_progress_frames

9. Cross-references

What Where
ocr.* wire CAP2 Β§4.8–4.9
Blob store dependency M07 Β§3
RAG integration M05 Β§10 q4 β€” now resolved
Vision overlap (Florence-2 OCR mode) M20 Β§4.3
ocr.document.indexed event CAP2 Β§7.1
Christof's harness external project; this module is the integration surface

10. Open questions

  1. Multilingual harness auto-update. The harness self-improves; should the model versions be event-logged so we can replay deterministically? Yes β€” record the harness version hash in each ocr.document.indexed event.
  2. Manuscript-quality preprocessing. Some historic documents need bespoke preprocessing (e.g. ink-bleed removal). Phase 2.5 might add a preprocess_profile enum.
  3. Reading order from layout. Currently we trust the backend's reading order. For multi-column documents, an explicit layout model (LayoutLMv3) might help. Phase 3.
  4. Streaming OCR for very large images. Currently atomic. Could tile and stream. Defer.