XmLLM / AGENTS.md
Claude
Sprint 0: project foundation β€” structure, settings, FastAPI skeleton
96f8c0a unverified
|
Raw
History Blame Contribute Delete
3.82 kB

AGENTS.md β€” XmLLM Project Rules

What this project is

XmLLM is a canonical-first document structure engine that converts images (and later PDFs) into structured ALTO XML and PAGE XML, via an internal canonical representation (CanonicalDocument).

Architecture invariants

These rules are non-negotiable. Any code change that violates them must be rejected.

1. Canonical-first

  • The system is never provider-first, ALTO-first, PAGE-first, or viewer-first.
  • All processing flows through CanonicalDocument.
  • ALTO XML and PAGE XML are output serializations, not internal representations.

2. Three internal objects

Object Role Never used for
RawProviderPayload Source truth Export, rendering
CanonicalDocument Business truth Direct UI rendering
ViewerProjection Rendering truth Validation, export decisions

3. Domain independence

  • Anneau A (domain) must not depend on FastAPI, frontend, or OpenSeadragon.
  • Anneau B (execution) may depend on domain.
  • Anneau C (API) may depend on domain and execution.
  • Anneau D (presentation) may depend on the API layer only.

4. Provenance on every node

Every node in CanonicalDocument must carry a provenance with:

  • provider: source engine name
  • adapter: adapter version
  • source_ref: path in the raw output
  • evidence_type: provider_native | derived | repaired | manual
  • derived_from: list of canonical IDs (empty if native)

5. Geometry conventions

  • bbox format: [x, y, width, height] β€” always. x = left edge, y = top edge.
  • coordinate origin: top_left β€” always in internal representation.
  • unit: px β€” always in the canonical model.
  • polygon: list[tuple[float, float]] | None β€” optional, preserved when available.
  • Providers returning [x1, y1, x2, y2] must be converted in their adapter.
  • No serializer may perform implicit heavy geometry conversion. All geometry must be normalized before it reaches a serializer.

6. Serializer rules

Serializers (ALTO, PAGE) must not:

  • Call any model or provider
  • Reconstruct segmentation
  • Correct text
  • Invent coordinates
  • Make export eligibility decisions

They receive a validated CanonicalDocument and produce deterministic XML.

7. Export eligibility

  • Every export decision must pass through validators + document policy.
  • A serializer is never called when export eligibility is none.
  • ALTO and PAGE eligibility are computed independently.

8. Enricher rules

Every enricher must:

  • Set provenance.evidence_type to derived or inferred
  • Update geometry.status if geometry was modified
  • Add warnings to the document when appropriate
  • Respect the active DocumentPolicy

Enrichers must not:

  • Hallucinate text
  • Invent fine-grained coordinates without geometric basis
  • Claim provider_native evidence

9. Viewer rules

  • The viewer never parses ALTO or PAGE XML.
  • It works exclusively from ViewerProjection.
  • No business logic in the presentation layer.

10. Single codebase

The same code runs:

  • Locally (bare Python)
  • In Docker
  • On Hugging Face Spaces

The only difference is STORAGE_ROOT and environment detection.

Bbox containment tolerance

Bbox containment checks (child within parent) use a configurable tolerance (default: 5px). This is set via BBOX_CONTAINMENT_TOLERANCE in the environment.

Coding conventions

  • Python 3.11+
  • Pydantic v2 for all models
  • lxml for XML serialization
  • Type hints everywhere
  • No Any types in domain models
  • Tests alongside implementation β€” no sprint ships without tests