HearthNet-Nemotron

Running on Zero

App Files Files Community

HearthNet-Nemotron / docs /p2_p3 /M17-ocr.md

Chris4K

p2, p3

70650b7 18 days ago

preview code

Raw

History Blame

11.4 kB

M17 — OCR Service

Spec version: v1.0 (Phase 2) Depends on: M03 (bus), M07 (blobs, for reading image/PDF inputs and storing extracted text), M11 (embedding, when integrating with M05 RAG), X04 (config), X03 (observability) Depended on by: M05 RAG (ingest of scanned PDFs), M20 vision (img.describe can fall back to OCR for text-heavy images)

1. Responsibility

Provide ocr.image@1.0 and ocr.pdf@1.0. Wrap several OCR backends so the bus can route between them by document type and language. Specifically engineered to handle:

Modern German printed text (Tesseract)
Handwriting (TrOCR / Microsoft Florence-OCR)
Historical scripts — Sütterlin, Kurrent, Latin, Arabic, Cyrillic (Christof's multilingual harness)
Mixed-language documents (auto-detection)

2. File layout

hearthnet/services/ocr/
├── __init__.py
├── service.py            # OcrService
└── backends/
    ├── __init__.py
    ├── base.py           # OcrBackend Protocol
    ├── tesseract.py      # Tesseract via pytesseract
    ├── trocr.py          # Microsoft TrOCR via transformers
    ├── multilingual.py   # Christof's self-improving harness (CHURRO, olmOCR-2)
    └── florence_ocr.py   # Florence-2 OCR mode (overlap with M20)

3. Public API

3.1 `backends/base.py`

# hearthnet/services/ocr/backends/base.py
from dataclasses import dataclass

@dataclass(frozen=True)
class OcrBlock:
    text:        str
    bbox:        tuple[int, int, int, int]   # (x, y, w, h) in pixel coords
    confidence:  float                       # 0..1
    language:    str | None

@dataclass(frozen=True)
class OcrPageResult:
    page:             int                    # 1-indexed
    text:             str                    # concatenated, reading order
    blocks:           list[OcrBlock]
    languages:        list[str]              # detected, ordered by prevalence
    confidence_mean:  float
    ms:               int

class OcrBackend(Protocol):
    name:               str       # "tesseract" | "trocr" | "multilingual" | "florence_ocr"
    languages_supported: list[str] # ISO 639-2 codes: "deu","eng","lat","ara","rus", ...
    supports_handwriting: bool
    max_image_pixels:   int

    async def warm(self) -> None: ...
    async def close(self) -> None: ...

    async def ocr_image(
        self,
        image_bytes: bytes,
        *,
        languages: list[str] | None,        # None → auto-detect
        preprocess: dict | None = None,     # {deskew, denoise, dpi}
    ) -> OcrPageResult: ...

    async def ocr_pdf_page(
        self,
        pdf_bytes: bytes,
        *,
        page: int,
        languages: list[str] | None,
        preprocess: dict | None = None,
    ) -> OcrPageResult: ...

    def health(self) -> dict: ...

3.2 Concrete backends

File	Class	Notes
`backends/tesseract.py`	`TesseractBackend(min_confidence: float = 0.5)`	Languages: any installed traineddata. Subprocess via pytesseract.
`backends/trocr.py`	`TrocrBackend(model: str = "microsoft/trocr-large-handwritten", device: str = "auto")`	Handwriting; CUDA preferred.
`backends/multilingual.py`	`MultilingualHarnessBackend(model: str = "self-improving-ocr-v1", device: str = "auto", harness_dir: Path)`	Christof's harness (CHURRO, olmOCR-2, retrieval-augmented correction, Kurrent/Sütterlin/Latin/Arabic/Cyrillic). Configured via `harness_dir`.
`backends/florence_ocr.py`	`FlorenceOcrBackend(model: str = "microsoft/Florence-2-large")`	Reuses M20 vision backend in OCR mode.

MultilingualHarnessBackend is the headline integration for Christof's existing work. It exposes the same OcrBackend interface and lets the harness's internal page-level VLMs do the heavy lifting.

3.3 `service.py`

# hearthnet/services/ocr/service.py
class OcrService:
    name    = "ocr"
    version = "1.0"

    def __init__(self, config: OcrConfig, blob_store: BlobStore, event_log: EventLog):
        self._backends: dict[str, OcrBackend] = self._build_backends(config)

    def capabilities(self) -> list[tuple[CapabilityDescriptor, Callable, ParamsPredicate]]:
        """One ocr.image entry per backend; one ocr.pdf entry per backend.
        params include backend name and supported languages."""

    async def start(self) -> None: ...
    async def stop(self) -> None: ...
    def health(self) -> dict: ...

    # --- handlers ---

    async def handle_image(self, req: RouteRequest) -> dict:
        """CAP2 §4.8.
        1. Resolve image_cid via blob_store
        2. Pick backend from params.backend
        3. Run; build response"""

    async def handle_pdf(self, req: RouteRequest) -> AsyncIterator[dict]:
        """CAP2 §4.9.
        1. Resolve doc_cid
        2. For each page in page_range:
           emit 'progress' frame
           emit 'page' frame
        3. If store_text:true, write concatenated text as new blob; emit done with text_cid"""

3.4 `params_compatible` predicate

def params_compatible(offered: dict, requested: dict) -> bool:
    # backend must match if specified
    if "backend" in requested and requested["backend"] != offered.get("backend"):
        return False
    # all requested languages must be supported by this backend
    requested_langs = set(requested.get("languages", []))
    offered_langs   = set(offered.get("languages_supported", []))
    return requested_langs.issubset(offered_langs)

4. Behaviour

4.1 Auto-language detection

If languages is omitted or set to ["auto"]:

Sample 3 random pages
Run lightweight script detection (Tesseract osd)
Choose top 2 scripts
Re-run with that language set

Backends that don't support osd fall back to a fixed default (configured per backend).

4.2 Preprocessing pipeline

preprocess dict supports:

deskew: bool — straighten image
denoise: bool — bilateral filter
binarize: bool — Otsu threshold
dpi: int — target resolution; upscale if lower
contrast_normalise: bool

Default: {"deskew": true, "denoise": false}. Heavy preprocessing slows ingest meaningfully; only enable per document.

4.3 Quality estimation

Each page result reports confidence_mean. Below 0.6, the service emits a low_quality warning frame and recommends:

Trying a different backend (e.g. switch from Tesseract to multilingual harness for historic text)
Raising DPI
Re-scanning

4.4 Integration with RAG

M05 §10 open question 4 is now answered:

RagService.handle_ingest receives a scanned PDF (mime_type=image/scanned-pdf or detected)
  → bus.call("ocr.pdf", (1,0), {input:{doc_cid:..., store_text:true}})
  → receive text_cid
  → ingest the text_cid blob (which is now extracted plaintext) as normal
  → emit rag.document.ingested event (with metadata noting ocr_backend used)

The OCR text is stored as a separate blob, content-addressed. Re-ingestion is idempotent.

4.5 Page-range and parallelism

page_range: [1, 50] lets callers process partial documents. Pages are OCR'd serially within one call. For very large PDFs, callers should split into ranges and call concurrently — the bus enforces per-capability concurrency.

OCR_MAX_PAGES_PER_REQUEST = 50 is the hard ceiling per call.

4.6 PDF text-layer detection

Before OCR'ing, the service checks if the PDF has an extractable text layer (via pypdf). If yes and confidence is decent (heuristic), it returns the text-layer content directly — much cheaper than OCR. Caller can force OCR with force_ocr: true.

4.7 Christof's multilingual harness integration

The MultilingualHarnessBackend wraps Christof's existing self-improving OCR pipeline:

Internal models: CHURRO (page-level VLM), olmOCR-2 (page-level VLM)
Retrieval-augmented correction over a script-specific corpus
Kurrent + Sütterlin support for German historical documents
Latin / Arabic / Cyrillic script recognition

Configuration:

config.ocr.multilingual_harness_dir = Path("/srv/ocr-harness")
config.ocr.multilingual_max_pages_concurrent = 2

The harness is GPU-intensive. On CPU-only nodes, it deregisters itself at startup.

5. Storage and lifecycle

Input image/PDF: fetched from blob store via CID
Output text: optionally stored as a new blob (store_text: true)
Side-effect: ocr.document.indexed event in the community log (carries text_cid for downstream replication)

OCR backends do NOT cache results inside themselves. Reuse comes from caching at the RAG/blob layer (same doc_cid → already-extracted-text blob).

6. Errors

Condition	Wire code
Unknown backend	`not_found`
Languages not supported by any backend	`bad_request`
Image too large (> max_image_pixels)	`bad_request`
Page-range exceeds document	`bad_request`
> OCR_MAX_PAGES_PER_REQUEST	`bad_request`
Backend crash	`internal_error`
GPU OOM on multilingual	`capacity_exceeded` (with retry_after)

7. Configuration

config.ocr.enabled                 = True
config.ocr.backends = [
    OcrBackendConfig(name="tesseract", languages=["deu","eng","fra","lat"]),
    OcrBackendConfig(name="trocr", model="microsoft/trocr-large-handwritten"),
    OcrBackendConfig(name="multilingual", harness_dir=Path("/srv/ocr-harness")),
]
config.ocr.default_dpi             = OCR_DEFAULT_DPI    # 300
config.ocr.max_pages_per_request   = OCR_MAX_PAGES_PER_REQUEST
config.ocr.text_layer_first        = True

Constants: OCR_DEFAULT_DPI, OCR_MAX_PAGES_PER_REQUEST.

8. Tests

Unit

test_descriptor_schema_validates_meta_schema
test_params_compatible_language_subset
test_text_layer_short_circuits_when_present
test_force_ocr_bypasses_text_layer
test_low_quality_emits_warning_frame

Integration

test_tesseract_german_print (with a known sample)
test_trocr_handwriting_sample
test_multilingual_kurrent_sample (if harness installed)
test_rag_ingest_scanned_pdf_end_to_end
test_ocr_pdf_progress_frames

9. Cross-references

What	Where
`ocr.*` wire	CAP2 §4.8–4.9
Blob store dependency	M07 §3
RAG integration	M05 §10 q4 — now resolved
Vision overlap (Florence-2 OCR mode)	M20 §4.3
`ocr.document.indexed` event	CAP2 §7.1
Christof's harness	external project; this module is the integration surface

10. Open questions

Multilingual harness auto-update. The harness self-improves; should the model versions be event-logged so we can replay deterministically? Yes — record the harness version hash in each ocr.document.indexed event.
Manuscript-quality preprocessing. Some historic documents need bespoke preprocessing (e.g. ink-bleed removal). Phase 2.5 might add a preprocess_profile enum.
Reading order from layout. Currently we trust the backend's reading order. For multi-column documents, an explicit layout model (LayoutLMv3) might help. Phase 3.
Streaming OCR for very large images. Currently atomic. Could tile and stream. Defer.