ERP-DocIQ

Running

App Files Files Community

kenmandal commited on 7 days ago

Commit

082d661

verified ·

1 Parent(s): ae053b7

Deploy latest: ERP DocIQ NLQ chatbot + reasoning models (MiniCPM3-4B/Command R7B) + ERP fine-tuning + extreme OCR docs

Browse files

Files changed (46) hide show

.gitattributes +3 -0
README.md +30 -16
backend/app/auth.py +4 -5
backend/app/config.py +16 -0
backend/app/erp/__init__.py +11 -0
backend/app/erp/chat.py +347 -0
backend/app/erp/data.py +275 -0
backend/app/erp/finetune.py +259 -0
backend/app/extraction_heuristics.py +6 -1
backend/app/main.py +92 -0
backend/app/models_registry.py +102 -0
backend/app/ocr/quality.py +207 -0
backend/app/pipeline/nodes.py +27 -3
backend/app/prompts/__init__.py +10 -2
backend/app/providers/blackforest.py +70 -0
backend/evals/datasets/extreme_contract_fax.gt.json +20 -0
backend/evals/datasets/extreme_contract_fax.png +3 -0
backend/evals/datasets/extreme_contract_fax.txt +27 -0
backend/evals/datasets/extreme_po_collage.gt.json +40 -0
backend/evals/datasets/extreme_po_collage.png +3 -0
backend/evals/datasets/extreme_po_collage.txt +20 -0
backend/evals/datasets/extreme_receipt_photo.gt.json +36 -0
backend/evals/datasets/extreme_receipt_photo.png +3 -0
backend/evals/datasets/extreme_receipt_photo.txt +17 -0
backend/evals/ocr_backend_report.json +135 -0
backend/evals/ocr_quality_report.json +313 -0
backend/evals/report.json +980 -0
backend/evals/run.py +147 -0
backend/evals/scorers.py +166 -0
backend/finetune/erp_finetune_report.json +106 -0
backend/finetune/erp_sft.jsonl +120 -0
backend/finetune/runs/hf_20260612T212346.json +120 -0
backend/finetune/runs/local_20260612T212257.json +108 -0
backend/finetune/runs/local_20260612T212332.json +106 -0
backend/finetune/runs/local_20260612T212357.json +108 -0
backend/finetune/runs/local_20260612T212413.json +106 -0
gradio_app.py +44 -0
results/erp_finetune_report.json +106 -0
results/erp_sft.jsonl +120 -0
results/ocr_quality_report.json +313 -0
scripts/finetune_erp.py +153 -0
scripts/generate_extreme_docs.py +421 -0
scripts/ocr_quality.py +67 -0
scripts/ocr_smoke.py +54 -0
scripts/run_dev.sh +35 -0
scripts/test_ocr.py +57 -0

.gitattributes CHANGED Viewed

@@ -34,3 +34,6 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
 backend/evals/datasets/complex_invoice_messy.png filter=lfs diff=lfs merge=lfs -text

 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
 backend/evals/datasets/complex_invoice_messy.png filter=lfs diff=lfs merge=lfs -text
+backend/evals/datasets/extreme_contract_fax.png filter=lfs diff=lfs merge=lfs -text
+backend/evals/datasets/extreme_po_collage.png filter=lfs diff=lfs merge=lfs -text
+backend/evals/datasets/extreme_receipt_photo.png filter=lfs diff=lfs merge=lfs -text

README.md CHANGED Viewed

@@ -8,28 +8,42 @@ sdk_version: 6.9.0
 app_file: gradio_app.py
 pinned: false
 license: mit
-short_description: Agentic OCR + IDP for retail docs (MiniCPM-V 8B, Tesseract)
 ---
-# ERP-DocIQ — Agentic OCR + Document Intelligence for Retail Back-Office
-Open-source **Intelligent Document Processing** for orders, receipts, invoices, contracts and
-subscription memos — a UiPath-style IDP rebuilt on **small models** and pluggable OCR.
-- **Pluggable OCR backends:** **MiniCPM-V-4.6** (≤32B small VLM, via the OpenBMB/ModelBest API),
-  **Tesseract** (real OCR, via `packages.txt`), and an offline sidecar fallback — auto-fallback chain.
-- **Hybrid pipeline:** OCR → classify → extract → normalize → enrich (RAG) → validate → post/HITL.
-- **Real vector RAG** (persistent), business KPIs, and a built-in **OCR self-test** that runs each
-  backend on real scanned images.
 ## Use it
-1. Open the **Process a document** tab.
-2. Pick an **OCR backend** (`auto`, `minicpm`, `tesseract`, …) and a sample, or upload your own PDF/PNG.
-3. See the extracted multi-layer fields + live KPIs. The **Search (RAG)** tab does semantic search.
 ## Configure (Space → Settings → Variables and secrets)
 - `MINICPM_BASE_URL=https://api.modelbest.cn/v1`, `MINICPM_API_KEY=…`, `MINICPM_MODEL=MiniCPM-V-4.6-Instruct`
-- Tesseract works out of the box (installed via `packages.txt`).
-Built for the **Build Small Hackathon** (small models, on Gradio). Uses MiniCPM-V (~8B) as the
-load-bearing OCR model.

 app_file: gradio_app.py
 pinned: false
 license: mit
+short_description: OCR/IDP + ERP NLQ chatbot on small models (MiniCPM)
 ---
+# ERP-DocIQ — Agentic Document Intelligence + ERP NLQ, on small models
+An open-source, UiPath-style back-office automation stack built entirely on **small models
+(≤32B)** — for the **Build Small Hackathon**. Three things, one app:
+1. **Read any document (OCR + IDP).** Hybrid pipeline (OCR → classify → extract → normalize →
+   enrich/RAG → validate → post/HITL) reads orders, receipts, invoices, contracts and complex
+   forms — even messy scans — with **OpenBMB MiniCPM-V-4.6** (≤32B VLM) and **Tesseract**.
+2. **Ask your ERP reports (ERP DocIQ).** A chatbot over a simulated retail ERP knowledgebase
+   (vendors · POs · invoices · GL · inventory · returns). Natural-language **NLQ → SQL**,
+   analytics, summaries and **"why"** reasoning — every figure comes from **real SQL over the
+   data**; **OpenBMB MiniCPM3-4B** only phrases the answer, it never invents numbers.
+3. **Adapt to your domain (fine-tuning).** A LoRA recipe fine-tunes **MiniCPM3-4B** on an ERP
+   instruction dataset; an offline CPU demo trains the NLQ-routing head on the same data with a
+   real before→after gain (**8.3% → 91.7%**). See `results/erp_finetune_report.json`.
+## Small models (≤32B) — by job
+| Lab | Models | Role |
+|---|---|---|
+| **OpenBMB** | MiniCPM-V-4.6 · MiniCPM-o-4.5 · **MiniCPM3-4B** | OCR/VLM · **reasoning · NLQ→SQL · fine-tune target** |
+| **Cohere** | Aya-Vision-8B/32B · **Command R7B** | OCR/VQA · **RAG · NLQ · reasoning** |
+| **Black Forest Labs** | FLUX.1 [dev]/[schnell] | image generation → synthetic test docs (not OCR) |
 ## Use it
+- **Process a document** — pick an OCR backend (`auto`, `minicpm`, `tesseract`) + a sample (or upload), see multi-layer extracted fields + KPIs.
+- **ERP DocIQ (chat)** — ask "Why did spend rise in Q2 2026?", "Top vendors by spend", "late-payment rate"; see the grounded answer, SQL, and the fine-tuning panel.
+- **Search (RAG)** — semantic vendor-master retrieval. **Web Automation** — multi-step browser flow.
+## Published results (`results/`)
+- `ocr_quality_report.json` — OCR CER/WER + field accuracy (MiniCPM-V **CER 2.6%** vs Tesseract 14.7%).
+- `erp_finetune_report.json` + `erp_sft.jsonl` — fine-tune metrics + the instruction dataset.
 ## Configure (Space → Settings → Variables and secrets)
 - `MINICPM_BASE_URL=https://api.modelbest.cn/v1`, `MINICPM_API_KEY=…`, `MINICPM_MODEL=MiniCPM-V-4.6-Instruct`
+- Tesseract is installed via `packages.txt`. **Without a key the app still runs** — ERP DocIQ uses
+  its deterministic SQL engine and OCR falls back to the sidecar, so every tab works offline.

backend/app/auth.py CHANGED Viewed

@@ -36,10 +36,9 @@ def make_auth_middleware(user: str, pwd: str):
             return await call_next(request)
         if _check(request.headers.get("authorization"), user, pwd):
             return await call_next(request)
-        return JSONResponse(
-            {"detail": "Authentication required"},
-            status_code=401,
-            headers={"WWW-Authenticate": 'Basic realm="Aperture"'},
-        )
     return auth_middleware

             return await call_next(request)
         if _check(request.headers.get("authorization"), user, pwd):
             return await call_next(request)
+        # NOTE: deliberately NO `WWW-Authenticate: Basic` header — that triggers the
+        # browser's native credential popup. The React SPA handles 401 itself and
+        # shows its own login screen, so we return a plain JSON 401.
+        return JSONResponse({"detail": "Authentication required"}, status_code=401)
     return auth_middleware

backend/app/config.py CHANGED Viewed

@@ -131,6 +131,17 @@ class Settings:
         self.llama_cloud_api_key = os.getenv("LLAMA_CLOUD_API_KEY") or None
         self.llamaparse_result_type = os.getenv("LLAMAPARSE_RESULT_TYPE", "markdown")
         # --- databases --------------------------------------------------------
         appdb = os.getenv("APP_DB_PATH")
         self.app_db_path = (
@@ -142,6 +153,11 @@ class Settings:
             (Path(ragdb) if Path(ragdb).is_absolute() else BACKEND_DIR / ragdb)
             if ragdb else self.writable_dir / "rag.db"
         )
         # --- browser ---
         self.playwright_headless = _bool("PLAYWRIGHT_HEADLESS", True)

         self.llama_cloud_api_key = os.getenv("LLAMA_CLOUD_API_KEY") or None
         self.llamaparse_result_type = os.getenv("LLAMAPARSE_RESULT_TYPE", "markdown")
+        # --- model labs (all ≤32B params — "small models") --------------------
+        # OpenBMB (MiniCPM family) — text/vision reasoning + OCR (via MINICPM_* above)
+        self.openbmb_model = os.getenv("OPENBMB_MODEL", self.minicpm_model)
+        # OpenBMB MiniCPM3-4B — text reasoning / NLQ→SQL / summarization (ERP DocIQ + fine-tune target)
+        self.openbmb_reasoner_model = os.getenv("OPENBMB_REASONER_MODEL", "MiniCPM3-4B")
+        # Black Forest Labs (FLUX) — image GENERATION for synthetic test documents
+        self.bfl_api_key = os.getenv("BFL_API_KEY") or None
+        self.bfl_model = os.getenv("BFL_MODEL", "flux-dev")  # api: flux-dev | flux-pro-1.1 | flux-schnell
+        # Cohere hosted API (in addition to the local HF Aya-Vision backend above)
+        self.cohere_api_key = os.getenv("COHERE_API_KEY") or None
         # --- databases --------------------------------------------------------
         appdb = os.getenv("APP_DB_PATH")
         self.app_db_path = (
             (Path(ragdb) if Path(ragdb).is_absolute() else BACKEND_DIR / ragdb)
             if ragdb else self.writable_dir / "rag.db"
         )
+        erpdb = os.getenv("ERP_DB_PATH")
+        self.erp_db_path = (
+            (Path(erpdb) if Path(erpdb).is_absolute() else BACKEND_DIR / erpdb)
+            if erpdb else self.writable_dir / "erp.db"
+        )
         # --- browser ---
         self.playwright_headless = _bool("PLAYWRIGHT_HEADLESS", True)

backend/app/erp/__init__.py ADDED Viewed

	@@ -0,0 +1,11 @@

+"""Simulated ERP knowledgebase + the ERP DocIQ chatbot (NLQ, summary, reasons, analytics).
+A deterministic, offline-first stand-in for a real retail ERP (SAP/Oracle/NetSuite).
+`data.py` seeds a realistic SQLite warehouse; `chat.py` answers natural-language
+questions over it (text-to-SQL NLQ + analytics + summarization), routed to a small
+reasoning model (OpenBMB MiniCPM3-4B) with a deterministic offline fallback.
+"""
+from .data import ErpWarehouse, ERP_SCHEMA_DOC, get_warehouse
+from .chat import ErpChat, answer_question
+__all__ = ["ErpWarehouse", "ERP_SCHEMA_DOC", "get_warehouse", "ErpChat", "answer_question"]

backend/app/erp/chat.py ADDED Viewed

	@@ -0,0 +1,347 @@

+"""ERP DocIQ chatbot — ask questions over the simulated ERP knowledgebase.
+Four capabilities, all grounded in real data from the warehouse (no hallucinated
+figures):
+  • NLQ      — natural-language → SQL → rows  (text-to-SQL)
+  • analytics — aggregations / rankings / rates
+  • summary  — narrative roll-up of a report
+  • reasons  — "why" questions, explained from the underlying data
+Design for offline-first integrity: a deterministic intent+SQL library always
+produces the correct numbers (real SQL over the warehouse). When a small reasoning
+model (OpenBMB MiniCPM3-4B, routed via ModelRouter) is available, it is used to
+(a) generate SQL for questions outside the deterministic library, and (b) phrase
+the final answer / explanation — always over the real computed rows, so the model
+narrates facts rather than inventing them.
+"""
+from __future__ import annotations
+import re
+import time
+from typing import Callable, Optional
+from ..observability import log_event
+from .data import ERP_SCHEMA_DOC, EXAMPLE_QUESTIONS, get_warehouse
+# ── deterministic intent → (sql, intent, narrator) library ────────────────────
+# Each entry: keywords (any-match scoring), a SQL builder, an intent label, and a
+# narrator that turns the result rows into a baseline natural-language answer.
+def _fmt_usd(x) -> str:
+    try:
+        return f"${float(x):,.0f}"
+    except (TypeError, ValueError):
+        return str(x)
+def _q_spend_by_month(wh):
+    sql = ("SELECT period, ROUND(SUM(amount),2) AS spend FROM gl_entries "
+           "GROUP BY period ORDER BY period")
+    cols, rows = wh.query(sql)
+    total = sum(r[1] for r in rows)
+    peak = max(rows, key=lambda r: r[1]) if rows else None
+    ans = (f"Total invoiced spend across {len(rows)} months is {_fmt_usd(total)}. "
+           f"The peak month was {peak[0]} at {_fmt_usd(peak[1])}." if peak else "No spend recorded.")
+    return sql, cols, rows, ans
+def _q_top_vendors(wh):
+    sql = ("SELECT v.name, ROUND(SUM(i.total),2) AS spend, COUNT(*) AS invoices "
+           "FROM invoices i JOIN vendors v ON v.vendor_id=i.vendor_id "
+           "GROUP BY v.vendor_id ORDER BY spend DESC LIMIT 5")
+    cols, rows = wh.query(sql)
+    lead = rows[0] if rows else None
+    ans = (f"Top vendor by spend is {lead[0]} at {_fmt_usd(lead[1])} across {lead[2]} invoices. "
+           f"The top 5 account for {_fmt_usd(sum(r[1] for r in rows))}." if lead else "No vendors.")
+    return sql, cols, rows, ans
+def _q_late_vendors(wh):
+    sql = ("SELECT v.name, SUM(CASE WHEN i.days_to_pay > CAST(substr(v.payment_terms,5) AS INT) "
+           "THEN 1 ELSE 0 END) AS late_invoices, COUNT(i.invoice_id) AS paid_invoices "
+           "FROM invoices i JOIN vendors v ON v.vendor_id=i.vendor_id "
+           "WHERE i.status='paid' GROUP BY v.vendor_id "
+           "HAVING late_invoices > 0 ORDER BY late_invoices DESC LIMIT 5")
+    cols, rows = wh.query(sql)
+    lead = rows[0] if rows else None
+    ans = (f"{lead[0]} had the most late payments ({lead[1]} of {lead[2]} paid invoices). "
+           "Late = paid after the vendor's net terms." if lead else "No late payments found.")
+    return sql, cols, rows, ans
+def _q_late_rate(wh):
+    sql = ("SELECT ROUND(100.0*SUM(CASE WHEN i.days_to_pay > CAST(substr(v.payment_terms,5) AS INT) "
+           "THEN 1 ELSE 0 END)/COUNT(*),1) AS late_pct, COUNT(*) AS paid_invoices "
+           "FROM invoices i JOIN vendors v ON v.vendor_id=i.vendor_id WHERE i.status='paid'")
+    cols, rows = wh.query(sql)
+    r = rows[0] if rows else None
+    ans = (f"The overall late-payment rate is {r[0]}% across {r[1]} paid invoices."
+           if r else "No paid invoices.")
+    return sql, cols, rows, ans
+def _q_spend_by_category(wh):
+    sql = ("SELECT p.category, ROUND(SUM(l.line_total),2) AS spend FROM po_lines l "
+           "JOIN products p ON p.sku=l.sku JOIN purchase_orders po ON po.po_id=l.po_id "
+           "WHERE po.status!='cancelled' GROUP BY p.category ORDER BY spend DESC")
+    cols, rows = wh.query(sql)
+    lead = rows[0] if rows else None
+    ans = (f"{lead[0]} is the largest category at {_fmt_usd(lead[1])}, "
+           f"out of {_fmt_usd(sum(r[1] for r in rows))} total." if lead else "No spend.")
+    return sql, cols, rows, ans
+def _q_why_q2(wh):
+    sql = ("SELECT period, account, ROUND(SUM(amount),2) AS spend FROM gl_entries "
+           "WHERE period >= '2026-04' AND period <= '2026-06' "
+           "GROUP BY period, account ORDER BY period, spend DESC")
+    cols, rows = wh.query(sql)
+    # compare fixtures share vs rest
+    fx = sum(r[2] for r in rows if "Fit-Out" in (r[1] or ""))
+    tot = sum(r[2] for r in rows)
+    share = round(100 * fx / tot, 1) if tot else 0
+    ans = (f"Q2 2026 spend was {_fmt_usd(tot)}, of which the Store-Fit-Out (Fixtures) "
+           f"account was {_fmt_usd(fx)} — {share}% of the quarter. The rise is driven by a "
+           "store-remodel program: more Fixtures POs at higher quantities.")
+    return sql, cols, rows, ans
+def _q_below_reorder(wh):
+    sql = ("SELECT i.sku, p.name, i.region, i.on_hand, i.reorder_point FROM inventory i "
+           "JOIN products p ON p.sku=i.sku WHERE i.on_hand < i.reorder_point "
+           "ORDER BY (i.reorder_point - i.on_hand) DESC LIMIT 15")
+    cols, rows = wh.query(sql)
+    ans = (f"{len(rows)} SKU/region positions are below reorder point and need replenishment."
+           if rows else "All inventory is above reorder point.")
+    return sql, cols, rows, ans
+def _q_open_invoices(wh):
+    sql = ("SELECT ROUND(SUM(total),2) AS open_value, COUNT(*) AS n FROM invoices WHERE status='open'")
+    cols, rows = wh.query(sql)
+    r = rows[0] if rows else None
+    ans = (f"There is {_fmt_usd(r[0])} in open (unpaid) invoices across {r[1]} invoices."
+           if r and r[0] else "No open invoices.")
+    return sql, cols, rows, ans
+def _q_return_reasons(wh):
+    sql = ("SELECT reason, COUNT(*) AS returns, ROUND(SUM(refund_amount),2) AS refunds "
+           "FROM returns GROUP BY reason ORDER BY refunds DESC")
+    cols, rows = wh.query(sql)
+    lead = rows[0] if rows else None
+    ans = (f"'{lead[0]}' drives the most refunds at {_fmt_usd(lead[2])} ({lead[1]} returns)."
+           if lead else "No returns.")
+    return sql, cols, rows, ans
+def _q_ap_health(wh):
+    # composite — used by the "summarize AP health" ask
+    late = _q_late_rate(wh)[3]
+    openv = _q_open_invoices(wh)[3]
+    topv = _q_top_vendors(wh)[3]
+    sql = ("SELECT (SELECT COUNT(*) FROM invoices) AS invoices, "
+           "(SELECT COUNT(*) FROM invoices WHERE status='open') AS open_invoices, "
+           "(SELECT ROUND(AVG(days_to_pay),1) FROM invoices WHERE status='paid') AS avg_days_to_pay")
+    cols, rows = wh.query(sql)
+    ans = f"AP health: {late} {openv} Avg days-to-pay is {rows[0][2]}. {topv}"
+    return sql, cols, rows, ans
+# (keywords, builder, intent)
+_LIBRARY: list[tuple[list[str], Callable, str]] = [
+    (["ap health", "accounts payable health", "summarize ap", "payables health", "ap summary"], _q_ap_health, "summary"),
+    (["why", "spike", "rise", "increase", "q2", "remodel", "driver"], _q_why_q2, "reasons"),
+    (["spend by month", "monthly spend", "spend per month", "spend by period", "total spend"], _q_spend_by_month, "analytics"),
+    (["top vendor", "biggest vendor", "vendor by spend", "largest vendor", "top 5 vendor"], _q_top_vendors, "analytics"),
+    (["late", "overdue", "paid late", "slow pay"], _q_late_vendors, "analytics"),
+    (["late rate", "late-payment rate", "on-time", "on time rate"], _q_late_rate, "analytics"),
+    (["spend by category", "category spend", "by category"], _q_spend_by_category, "analytics"),
+    (["reorder", "below reorder", "replenish", "stockout", "low stock"], _q_below_reorder, "analytics"),
+    (["open invoice", "unpaid", "outstanding", "open ap"], _q_open_invoices, "analytics"),
+    (["return", "refund", "rma"], _q_return_reasons, "analytics"),
+]
+def _match(question: str) -> Optional[tuple[Callable, str, int]]:
+    """Pick a deterministic template only when a real keyphrase is present.
+    Score = 3 × (words in matched keyphrases), so a longer, more specific phrase
+    ("late-payment rate") beats a bare token ("late"), and a question with *no*
+    library keyphrase scores 0 → it falls through to LLM text-to-SQL instead of
+    being force-fit to the nearest template.
+    """
+    q = question.lower()
+    best, best_score, best_intent = None, 0, ""
+    for keys, fn, intent in _LIBRARY:
+        phrase_words = sum(len(k.split()) for k in keys if k in q)
+        distinct = {w for k in keys for w in k.split() if len(w) > 3}
+        overlap = sum(1 for w in distinct if w in q)
+        score = phrase_words * 3 + (1 if overlap >= 2 else 0)
+        if score > best_score:
+            best, best_score, best_intent = fn, score, intent
+    if best and best_score >= 3:  # >=3 ⇒ at least one full keyphrase matched
+        return best, best_intent, best_score
+    return None
+def _intent_of(question: str) -> str:
+    q = question.lower()
+    if any(w in q for w in ("why", "reason", "explain", "driver", "cause")):
+        return "reasons"
+    if any(w in q for w in ("summar", "overview", "health", "how are")):
+        return "summary"
+    if any(w in q for w in ("how many", "total", "average", "rate", "top", "rank", "by month",
+                            "by category", "count", "sum")):
+        return "analytics"
+    return "nlq"
+class ErpChat:
+    def __init__(self, settings, router=None, warehouse=None, metrics=None, db=None) -> None:
+        self.settings = settings
+        self.router = router
+        self.wh = warehouse or get_warehouse(settings)
+        self.metrics = metrics
+        self.db = db
+    # --- public ---------------------------------------------------------------
+    def answer(self, question: str, use_llm: bool = True, run_id: str = "erp-chat") -> dict:
+        t0 = time.perf_counter()
+        question = (question or "").strip()
+        if not question:
+            return {"answer": "Ask me about ERP spend, vendors, payments, inventory or returns.",
+                    "intent": "help", "examples": EXAMPLE_QUESTIONS}
+        engine = "deterministic"
+        model = None
+        cost = 0.0
+        sql = cols = rows = None
+        baseline = ""
+        intent = _intent_of(question)
+        m = _match(question)
+        if m:
+            fn, intent, _score = m
+            try:
+                sql, cols, rows, baseline = fn(self.wh)
+            except Exception as e:
+                log_event("error", "ERP deterministic query failed", error=str(e), q=question)
+                baseline = f"Query error: {e}"
+        elif use_llm and self._llm_available():
+            # text-to-SQL for questions outside the deterministic library
+            sql, cols, rows, baseline, model, cost = self._llm_nlq(question, run_id)
+            engine = "llm-sql"
+            intent = "nlq"
+        answer = baseline
+        # Grounded NL phrasing / explanation via the small reasoning model.
+        if use_llm and self._llm_available() and rows is not None and intent in ("summary", "reasons", "analytics"):
+            phrased, pmodel, pcost = self._llm_phrase(question, intent, sql, cols, rows, baseline, run_id)
+            if phrased:
+                answer, model, cost = phrased, pmodel or model, cost + pcost
+                engine = engine if engine == "llm-sql" else "deterministic+llm"
+        if not m and engine == "deterministic" and rows is None:
+            # nothing matched and no model — guide the user
+            answer = ("I can answer that best with one of these: " +
+                      "; ".join(EXAMPLE_QUESTIONS[:5]) + ".")
+            intent = "help"
+        latency_ms = round((time.perf_counter() - t0) * 1000, 1)
+        result = {
+            "question": question, "intent": intent, "engine": engine,
+            "model": model or "deterministic", "sql": sql, "columns": cols,
+            "rows": (rows or [])[:50], "row_count": len(rows or []),
+            "answer": answer, "latency_ms": latency_ms, "cost_usd": round(cost, 6),
+        }
+        self._record(result, run_id)
+        return result
+    # --- internals ------------------------------------------------------------
+    def _llm_available(self) -> bool:
+        if not self.router:
+            return False
+        reg = self.router.registry
+        return any(getattr(reg, n, None) and getattr(reg, n).available()
+                   for n in ("minicpm", "anthropic", "gemini", "local"))
+    def _llm_nlq(self, question: str, run_id: str):
+        from ..providers.base import CacheBlock, LLMRequest
+        sys_prompt = (
+            "You are a text-to-SQL assistant for a read-only SQLite ERP warehouse. "
+            "Given the schema and a question, return ONLY one SQLite SELECT query (no prose, "
+            "no markdown fences, no semicolon). Use only the tables/columns in the schema.\n\n"
+            + ERP_SCHEMA_DOC)
+        req = LLMRequest(
+            system_blocks=[CacheBlock(sys_prompt, cacheable=True)],
+            user_content=f"Question: {question}\nSQL:",
+            task="nlq", max_tokens=256, temperature=0.0)
+        resp = self.router.run(req, run_id)
+        sql = _clean_sql(resp.text)
+        cost = getattr(resp, "cost_usd", 0.0) or 0.0
+        try:
+            cols, rows = self.wh.query(sql)
+            baseline = f"Returned {len(rows)} row(s)." if rows else "No rows matched."
+        except Exception as e:
+            cols, rows, baseline = None, None, f"Generated SQL could not run safely: {e}"
+        return sql, cols, rows, baseline, resp.model, cost
+    def _llm_phrase(self, question, intent, sql, cols, rows, baseline, run_id):
+        from ..providers.base import CacheBlock, LLMRequest
+        verb = {"summary": "Write a concise executive summary",
+                "reasons": "Explain the most likely reason(s)",
+                "analytics": "Give a one-paragraph analytical readout"}.get(intent, "Answer")
+        sys_prompt = (
+            "You are an ERP financial analyst. Using ONLY the query result provided, answer the "
+            "user's question. Cite concrete figures from the rows; never invent numbers. Be brief "
+            "(2-4 sentences).")
+        table = _rows_to_text(cols, rows)
+        req = LLMRequest(
+            system_blocks=[CacheBlock(sys_prompt, cacheable=True)],
+            user_content=f"Question: {question}\n\nQuery result:\n{table}\n\nBaseline fact: {baseline}\n\n{verb}:",
+            task="summarize", max_tokens=300, temperature=0.2)
+        resp = self.router.run(req, run_id)
+        if resp.error or not resp.text.strip():
+            return None, None, 0.0
+        return resp.text.strip(), resp.model, (getattr(resp, "cost_usd", 0.0) or 0.0)
+    def _record(self, result: dict, run_id: str) -> None:
+        try:
+            log_event("info", "ERP chat", intent=result["intent"], engine=result["engine"],
+                      model=result["model"], rows=result["row_count"], q=result["question"][:120])
+        except Exception:
+            pass
+        if self.db is not None:
+            try:
+                self.db.audit("erp_chat", run_id=run_id,
+                              detail={"q": result["question"][:200], "intent": result["intent"],
+                                      "engine": result["engine"], "rows": result["row_count"]})
+            except Exception:
+                pass
+def _clean_sql(text: str) -> str:
+    t = (text or "").strip()
+    t = re.sub(r"^```(?:sql)?", "", t, flags=re.IGNORECASE).strip()
+    t = re.sub(r"```$", "", t).strip()
+    # take the first statement only
+    t = t.split(";")[0].strip()
+    m = re.search(r"(select|with)\b.+", t, re.IGNORECASE | re.DOTALL)
+    return m.group(0).strip() if m else t
+def _rows_to_text(cols, rows, limit: int = 25) -> str:
+    if not cols:
+        return "(no rows)"
+    lines = [" | ".join(map(str, cols))]
+    for r in (rows or [])[:limit]:
+        lines.append(" | ".join("" if v is None else str(v) for v in r))
+    return "\n".join(lines)
+def answer_question(question: str, settings, router=None, warehouse=None, metrics=None,
+                    db=None, use_llm: bool = True) -> dict:
+    return ErpChat(settings, router=router, warehouse=warehouse, metrics=metrics,
+                   db=db).answer(question, use_llm=use_llm)

backend/app/erp/data.py ADDED Viewed

	@@ -0,0 +1,275 @@

+"""Simulated retail ERP data warehouse (SQLite) — the knowledgebase the ERP DocIQ
+chatbot reasons over, and the source domain for the fine-tuning dataset.
+Deterministic: a fixed RNG seed makes the whole warehouse reproducible, so NLQ
+answers, analytics, evals and the fine-tune dataset are all stable across runs.
+Schema (a retail accounts-payable / procurement slice):
+  vendors(vendor_id, name, region, category, payment_terms, on_time_rate, risk_tier)
+  products(sku, name, category, unit_cost, unit_price)
+  purchase_orders(po_id, vendor_id, order_date, status, region, amount)
+  po_lines(po_id, sku, qty, unit_price, line_total)
+  invoices(invoice_id, po_id, vendor_id, invoice_date, due_date, amount, tax,
+           total, status, paid_date, days_to_pay)
+  gl_entries(entry_id, invoice_id, account, cost_center, period, amount)
+  inventory(sku, region, on_hand, reorder_point, monthly_demand)
+  returns(return_id, sku, region, return_date, qty, reason, refund_amount)
+This is intentionally a *small* but internally-consistent dataset: invoices roll up
+from PO lines, GL entries roll up from invoices, returns reference real SKUs, so
+analytics ("why did spend spike in Q2", "top vendors by late payments") are answerable
+from the data rather than canned.
+"""
+from __future__ import annotations
+import random
+import sqlite3
+import threading
+from datetime import date, timedelta
+from pathlib import Path
+SEED = 20260101
+REGIONS = ["Northeast", "Midwest", "South", "West"]
+CATEGORIES = ["Fixtures", "Electronics", "Apparel", "Grocery", "Packaging", "Logistics"]
+RISK = ["low", "low", "low", "medium", "medium", "high"]
+VENDOR_NAMES = [
+    "Meridian Industrial", "Nordic Fixture Works", "BrightLite Electronics", "Halcyon Build",
+    "Cascade Apparel Co", "Summit Packaging", "BlueRiver Logistics", "Orchard Grocery Supply",
+    "PrimeEdge Components", "Vertex Retail Systems", "Granite State Goods", "Copperline Textiles",
+    "Lakeside Distribution", "IronGate Hardware", "Pinnacle Foods", "Aurora Display Group",
+]
+PRODUCTS = [
+    ("SKU-1001", "Heavy-gauge shelf unit", "Fixtures", 142.0, 189.0),
+    ("SKU-1002", "LED retail strip 2m", "Electronics", 14.5, 22.4),
+    ("SKU-1003", "Endcap display birch", "Fixtures", 232.0, 310.0),
+    ("SKU-1004", "Thermal receipt rolls", "Packaging", 1.1, 2.4),
+    ("SKU-1005", "Barcode scanner USB", "Electronics", 38.0, 59.0),
+    ("SKU-1006", "Store associate polo", "Apparel", 9.2, 18.0),
+    ("SKU-1007", "Pallet wrap roll", "Packaging", 18.0, 27.5),
+    ("SKU-1008", "Organic coffee 1kg", "Grocery", 8.5, 14.0),
+    ("SKU-1009", "Freight pallet move", "Logistics", 22.0, 35.0),
+    ("SKU-1010", "Security tag pack", "Electronics", 4.0, 7.5),
+    ("SKU-1011", "Checkout counter mat", "Fixtures", 26.0, 41.0),
+    ("SKU-1012", "Reusable tote bag", "Apparel", 2.3, 5.0),
+]
+ACCOUNTS = {
+    "Fixtures": "5000-Store-Fit-Out", "Electronics": "5100-IT-Equipment",
+    "Apparel": "5200-Uniforms", "Grocery": "5300-COGS-Grocery",
+    "Packaging": "5400-Supplies", "Logistics": "5500-Freight",
+}
+RETURN_REASONS = ["damaged", "wrong item", "defective", "overstock", "late delivery"]
+class ErpWarehouse:
+    """Read-mostly SQLite warehouse with a guarded NLQ query surface."""
+    def __init__(self, db_path: str | Path) -> None:
+        self.db_path = Path(db_path)
+        self.db_path.parent.mkdir(parents=True, exist_ok=True)
+        self._lock = threading.Lock()
+        self._conn = sqlite3.connect(str(self.db_path), check_same_thread=False)
+        self._conn.row_factory = sqlite3.Row
+        if not self._has_data():
+            self._build()
+    def _has_data(self) -> bool:
+        try:
+            return self._conn.execute("SELECT 1 FROM invoices LIMIT 1").fetchone() is not None
+        except sqlite3.OperationalError:
+            return False
+    # --- schema + seed --------------------------------------------------------
+    def _build(self) -> None:
+        rng = random.Random(SEED)
+        with self._lock:
+            c = self._conn
+            c.executescript(
+                """
+                DROP TABLE IF EXISTS vendors; DROP TABLE IF EXISTS products;
+                DROP TABLE IF EXISTS purchase_orders; DROP TABLE IF EXISTS po_lines;
+                DROP TABLE IF EXISTS invoices; DROP TABLE IF EXISTS gl_entries;
+                DROP TABLE IF EXISTS inventory; DROP TABLE IF EXISTS returns;
+                CREATE TABLE vendors(vendor_id TEXT PRIMARY KEY, name TEXT, region TEXT,
+                    category TEXT, payment_terms TEXT, on_time_rate REAL, risk_tier TEXT);
+                CREATE TABLE products(sku TEXT PRIMARY KEY, name TEXT, category TEXT,
+                    unit_cost REAL, unit_price REAL);
+                CREATE TABLE purchase_orders(po_id TEXT PRIMARY KEY, vendor_id TEXT,
+                    order_date TEXT, status TEXT, region TEXT, amount REAL);
+                CREATE TABLE po_lines(po_id TEXT, sku TEXT, qty INTEGER, unit_price REAL,
+                    line_total REAL);
+                CREATE TABLE invoices(invoice_id TEXT PRIMARY KEY, po_id TEXT, vendor_id TEXT,
+                    invoice_date TEXT, due_date TEXT, amount REAL, tax REAL, total REAL,
+                    status TEXT, paid_date TEXT, days_to_pay INTEGER);
+                CREATE TABLE gl_entries(entry_id TEXT PRIMARY KEY, invoice_id TEXT, account TEXT,
+                    cost_center TEXT, period TEXT, amount REAL);
+                CREATE TABLE inventory(sku TEXT, region TEXT, on_hand INTEGER,
+                    reorder_point INTEGER, monthly_demand INTEGER);
+                CREATE TABLE returns(return_id TEXT, sku TEXT, region TEXT, return_date TEXT,
+                    qty INTEGER, reason TEXT, refund_amount REAL);
+                """
+            )
+            # vendors
+            vendors = []
+            for i, nm in enumerate(VENDOR_NAMES):
+                cat = CATEGORIES[i % len(CATEGORIES)]
+                vid = f"V-{1000+i}"
+                terms = rng.choice(["Net 30", "Net 30", "Net 45", "Net 60"])
+                on_time = round(rng.uniform(0.72, 0.99), 3)
+                vendors.append((vid, nm, rng.choice(REGIONS), cat, terms, on_time, RISK[i % len(RISK)]))
+            c.executemany("INSERT INTO vendors VALUES (?,?,?,?,?,?,?)", vendors)
+            c.executemany("INSERT INTO products VALUES (?,?,?,?,?)", PRODUCTS)
+            prod_by_cat: dict[str, list] = {}
+            for p in PRODUCTS:
+                prod_by_cat.setdefault(p[2], []).append(p)
+            # 12 months of POs → invoices → GL. A deliberate Q2 spend spike on Fixtures
+            # (store-remodel program) makes "why did spend rise" answerable from data.
+            po_n = inv_n = gl_n = 0
+            start = date(2025, 7, 1)
+            for month in range(12):
+                m_date = (start + timedelta(days=30 * month))
+                period = m_date.strftime("%Y-%m")
+                # base order volume, with a Fixtures surge in 2026 Q2 (months 9-11)
+                n_orders = rng.randint(10, 16)
+                surge = month in (9, 10, 11)
+                for _ in range(n_orders):
+                    v = rng.choice(vendors)
+                    vid, vcat, vregion, terms, on_time = v[0], v[3], v[2], v[4], v[5]
+                    # bias product to vendor category; surge picks Fixtures
+                    cat = "Fixtures" if (surge and rng.random() < 0.45) else vcat
+                    pool = prod_by_cat.get(cat) or PRODUCTS
+                    po_n += 1
+                    po_id = f"PO-{2000+po_n}"
+                    od = m_date + timedelta(days=rng.randint(0, 27))
+                    n_lines = rng.randint(1, 4)
+                    amount = 0.0
+                    lines = []
+                    for _ in range(n_lines):
+                        p = rng.choice(pool)
+                        qty = rng.randint(2, 40) * (3 if (surge and cat == "Fixtures") else 1)
+                        unit = round(p[4] * rng.uniform(0.95, 1.05), 2)
+                        lt = round(qty * unit, 2)
+                        amount += lt
+                        lines.append((po_id, p[0], qty, unit, lt))
+                    status = rng.choice(["received", "received", "received", "open", "cancelled"])
+                    c.execute("INSERT INTO purchase_orders VALUES (?,?,?,?,?,?)",
+                              (po_id, vid, od.isoformat(), status, vregion, round(amount, 2)))
+                    c.executemany("INSERT INTO po_lines VALUES (?,?,?,?,?)", lines)
+                    if status == "cancelled":
+                        continue
+                    # invoice
+                    inv_n += 1
+                    inv_id = f"INV-{5000+inv_n}"
+                    idate = od + timedelta(days=rng.randint(1, 10))
+                    term_days = int(terms.split()[1])
+                    due = idate + timedelta(days=term_days)
+                    tax = round(amount * 0.0825, 2)
+                    total = round(amount + tax, 2)
+                    paid = rng.random() < 0.82
+                    if paid:
+                        # late if vendor has low on-time rate
+                        late = rng.random() > on_time
+                        dd = rng.randint(term_days + 3, term_days + 25) if late else rng.randint(8, term_days)
+                        paid_date = (idate + timedelta(days=dd)).isoformat()
+                        istatus = "paid"
+                        days_to_pay = dd
+                    else:
+                        paid_date, istatus, days_to_pay = None, "open", None
+                    c.execute("INSERT INTO invoices VALUES (?,?,?,?,?,?,?,?,?,?,?)",
+                              (inv_id, po_id, vid, idate.isoformat(), due.isoformat(),
+                               round(amount, 2), tax, total, istatus, paid_date, days_to_pay))
+                    gl_n += 1
+                    c.execute("INSERT INTO gl_entries VALUES (?,?,?,?,?,?)",
+                              (f"GL-{9000+gl_n}", inv_id, ACCOUNTS.get(cat, "5900-Other"),
+                               f"CC-{vregion[:3].upper()}", period, total))
+            # inventory + returns
+            for p in PRODUCTS:
+                for r in REGIONS:
+                    dem = rng.randint(20, 200)
+                    c.execute("INSERT INTO inventory VALUES (?,?,?,?,?)",
+                              (p[0], r, rng.randint(0, 400), int(dem * 0.5), dem))
+            ret_n = 0
+            for _ in range(60):
+                p = rng.choice(PRODUCTS)
+                ret_n += 1
+                rdate = (start + timedelta(days=rng.randint(0, 360)))
+                qty = rng.randint(1, 12)
+                c.execute("INSERT INTO returns VALUES (?,?,?,?,?,?,?)",
+                          (f"R-{7000+ret_n}", p[0], rng.choice(REGIONS), rdate.isoformat(),
+                           qty, rng.choice(RETURN_REASONS), round(qty * p[4], 2)))
+            c.commit()
+    # --- guarded query surface (for NLQ) --------------------------------------
+    def query(self, sql: str, limit: int = 200) -> tuple[list[str], list[list]]:
+        """Execute a single read-only SELECT. Raises ValueError on anything unsafe."""
+        safe = sql.strip().rstrip(";").strip()
+        low = safe.lower()
+        if not low.startswith(("select", "with")):
+            raise ValueError("only SELECT/WITH queries are allowed")
+        forbidden = (" insert ", " update ", " delete ", " drop ", " alter ", " create ",
+                     " attach ", " pragma ", " replace ", "--", ";")
+        padded = f" {low} "
+        for f in forbidden:
+            if f in padded:
+                raise ValueError(f"forbidden token in query: {f.strip()!r}")
+        if " limit " not in low:
+            safe = f"{safe} LIMIT {limit}"
+        with self._lock:
+            cur = self._conn.execute(safe)
+            rows = cur.fetchall()
+            cols = [d[0] for d in cur.description]
+        return cols, [list(r) for r in rows]
+    def scalar(self, sql: str):
+        cols, rows = self.query(sql, limit=1)
+        return rows[0][0] if rows else None
+    def table_counts(self) -> dict:
+        out = {}
+        for t in ("vendors", "products", "purchase_orders", "po_lines", "invoices",
+                  "gl_entries", "inventory", "returns"):
+            out[t] = self.scalar(f"SELECT COUNT(*) FROM {t}")
+        return out
+# Compact schema description handed to the NLQ model (kept byte-stable for caching).
+ERP_SCHEMA_DOC = """ERP warehouse schema (SQLite, retail procurement / AP):
+- vendors(vendor_id, name, region, category, payment_terms, on_time_rate, risk_tier)
+- products(sku, name, category, unit_cost, unit_price)
+- purchase_orders(po_id, vendor_id, order_date, status, region, amount)
+- po_lines(po_id, sku, qty, unit_price, line_total)
+- invoices(invoice_id, po_id, vendor_id, invoice_date, due_date, amount, tax, total, status, paid_date, days_to_pay)
+- gl_entries(entry_id, invoice_id, account, cost_center, period, amount)   -- period is 'YYYY-MM'
+- inventory(sku, region, on_hand, reorder_point, monthly_demand)
+- returns(return_id, sku, region, return_date, qty, reason, refund_amount)
+Notes: invoices.status in ('paid','open'); a payment is LATE when days_to_pay > payment_terms days.
+Spend = invoices.total. Dates are ISO 'YYYY-MM-DD'. gl_entries.period groups spend by month."""
+EXAMPLE_QUESTIONS = [
+    "What was total invoiced spend by month?",
+    "Who are the top 5 vendors by spend?",
+    "Which vendors paid late most often?",
+    "Why did spend rise in Q2 2026?",
+    "What is the late-payment rate overall?",
+    "Show spend by category.",
+    "Summarize accounts payable health.",
+    "Which SKUs are below reorder point?",
+    "What is the total value of open (unpaid) invoices?",
+    "Top return reasons by refund amount?",
+]
+_WAREHOUSE: ErpWarehouse | None = None
+def get_warehouse(settings) -> ErpWarehouse:
+    """Process-wide singleton, seeded under the writable dir."""
+    global _WAREHOUSE
+    if _WAREHOUSE is None:
+        path = getattr(settings, "erp_db_path", None) or (settings.writable_dir / "erp.db")
+        _WAREHOUSE = ErpWarehouse(path)
+    return _WAREHOUSE

backend/app/erp/finetune.py ADDED Viewed

	@@ -0,0 +1,259 @@

+"""ERP-domain fine-tuning: dataset + an offline domain-adaptation trainer.
+Two honest paths share ONE dataset, built from the simulated ERP knowledgebase:
+  • PRODUCTION (GPU):  `scripts/finetune_erp.py --backend hf` LoRA-fine-tunes the
+    OpenBMB **MiniCPM3-4B** text model (PEFT + TRL SFTTrainer) on the JSONL below.
+    That is the "fine-tune a small model from the list" deliverable.
+  • OFFLINE DEMO (CPU, runs anywhere — no torch/GPU): `--backend local` trains a
+    compact ERP **NLQ-routing head** (multinomial softmax over hashed n-gram
+    features, pure numpy) on the SAME examples, with a real train/test split, a
+    real training-loss curve, and a real BEFORE→AFTER accuracy gain. This is the
+    small model's domain-adaptation layer — it demonstrates the training loop +
+    eval methodology end-to-end so the story is complete without a GPU.
+Both report into `erp_finetune_report.json` (served at /api/erp/finetune-report).
+"""
+from __future__ import annotations
+import hashlib
+import json
+import math
+import random
+import time
+from pathlib import Path
+import numpy as np
+from .data import get_warehouse
+# ── canonical ERP NLQ templates (label space) + rich paraphrases ──────────────
+# Each template is one SQL "skill" the model must learn to route to from varied
+# natural phrasings. Held-out paraphrases test generalization, not memorization.
+TEMPLATES = [
+    {"id": "spend_by_month", "intent": "analytics",
+     "sql": "SELECT period, ROUND(SUM(amount),2) AS spend FROM gl_entries GROUP BY period ORDER BY period",
+     "paraphrases": [
+         "What was total invoiced spend by month?", "Show monthly spend.",
+         "Break spend down per month.", "How much did we invoice each month?",
+         "Monthly invoiced spend trend?", "Spend by period please.",
+         "Give me the month-by-month spend.", "Total spend grouped by month.",
+         "What's our spend over the months?", "Plot spend per month.",
+         "Monthly AP spend totals?", "How has spend trended month to month?"]},
+    {"id": "top_vendors", "intent": "analytics",
+     "sql": "SELECT v.name, ROUND(SUM(i.total),2) AS spend, COUNT(*) AS invoices FROM invoices i JOIN vendors v ON v.vendor_id=i.vendor_id GROUP BY v.vendor_id ORDER BY spend DESC LIMIT 5",
+     "paraphrases": [
+         "Who are the top 5 vendors by spend?", "Which vendors do we spend the most with?",
+         "List our biggest suppliers.", "Top vendors by total spend?",
+         "Rank vendors by spend.", "Which suppliers cost us the most?",
+         "Show the five largest vendors.", "Biggest vendors by invoice value?",
+         "Our highest-spend vendors?", "Top suppliers ranked by spend.",
+         "Which vendors get the most of our money?", "Largest vendors please."]},
+    {"id": "late_vendors", "intent": "analytics",
+     "sql": "SELECT v.name, SUM(CASE WHEN i.days_to_pay > CAST(substr(v.payment_terms,5) AS INT) THEN 1 ELSE 0 END) AS late_invoices, COUNT(i.invoice_id) AS paid_invoices FROM invoices i JOIN vendors v ON v.vendor_id=i.vendor_id WHERE i.status='paid' GROUP BY v.vendor_id HAVING late_invoices > 0 ORDER BY late_invoices DESC LIMIT 5",
+     "paraphrases": [
+         "Which vendors paid late most often?", "Who are our worst late-paying vendors?",
+         "Vendors with the most overdue payments?", "Which suppliers do we pay late?",
+         "Show vendors with frequent late payments.", "Worst offenders for late payment?",
+         "Which vendors are habitually overdue?", "List vendors by late-payment count.",
+         "Who keeps getting paid past terms?", "Late payers among our vendors?",
+         "Vendors most often paid after due date?", "Which suppliers have payment delays?"]},
+    {"id": "late_rate", "intent": "analytics",
+     "sql": "SELECT ROUND(100.0*SUM(CASE WHEN i.days_to_pay > CAST(substr(v.payment_terms,5) AS INT) THEN 1 ELSE 0 END)/COUNT(*),1) AS late_pct, COUNT(*) AS paid_invoices FROM invoices i JOIN vendors v ON v.vendor_id=i.vendor_id WHERE i.status='paid'",
+     "paraphrases": [
+         "What is the late-payment rate overall?", "What percent of invoices are paid late?",
+         "Our overall late payment percentage?", "How often do we pay late, as a rate?",
+         "Share of late payments?", "What fraction of payments miss terms?",
+         "Late-payment ratio across all invoices?", "Overall on-time vs late rate?",
+         "What's our late payment rate?", "Percentage of overdue payments overall?",
+         "How bad is our late-payment rate?", "Give the global late payment percentage."]},
+    {"id": "spend_by_category", "intent": "analytics",
+     "sql": "SELECT p.category, ROUND(SUM(l.line_total),2) AS spend FROM po_lines l JOIN products p ON p.sku=l.sku JOIN purchase_orders po ON po.po_id=l.po_id WHERE po.status!='cancelled' GROUP BY p.category ORDER BY spend DESC",
+     "paraphrases": [
+         "Show spend by category.", "How much do we spend per product category?",
+         "Category-level spend breakdown?", "Spend grouped by category.",
+         "Which categories cost the most?", "Break down spend across categories.",
+         "Spend per category please.", "What's our category spend mix?",
+         "Total spend for each category?", "Categories ranked by spend.",
+         "Where does spend go by category?", "Category spend totals?"]},
+    {"id": "why_q2", "intent": "reasons",
+     "sql": "SELECT period, account, ROUND(SUM(amount),2) AS spend FROM gl_entries WHERE period >= '2026-04' AND period <= '2026-06' GROUP BY period, account ORDER BY period, spend DESC",
+     "paraphrases": [
+         "Why did spend rise in Q2 2026?", "What drove the Q2 spend increase?",
+         "Explain the spend spike in Q2.", "Reason for higher spending in Q2 2026?",
+         "Why was Q2 so expensive?", "What caused the second-quarter cost jump?",
+         "Account for the Q2 2026 spend surge.", "Why is Q2 spend up?",
+         "What's behind the Q2 increase?", "Drivers of the Q2 spend rise?",
+         "Why did costs climb in Q2 2026?", "Explain why Q2 spend went up."]},
+    {"id": "below_reorder", "intent": "analytics",
+     "sql": "SELECT i.sku, p.name, i.region, i.on_hand, i.reorder_point FROM inventory i JOIN products p ON p.sku=i.sku WHERE i.on_hand < i.reorder_point ORDER BY (i.reorder_point - i.on_hand) DESC LIMIT 15",
+     "paraphrases": [
+         "Which SKUs are below reorder point?", "What needs replenishing?",
+         "Show items under their reorder level.", "Which products are low on stock?",
+         "List SKUs below reorder threshold.", "What should we reorder?",
+         "Inventory below reorder point?", "Which items risk stockout?",
+         "Stock positions under reorder point?", "What's running low in inventory?",
+         "SKUs needing replenishment?", "Which products fell below reorder?"]},
+    {"id": "open_invoices", "intent": "analytics",
+     "sql": "SELECT ROUND(SUM(total),2) AS open_value, COUNT(*) AS n FROM invoices WHERE status='open'",
+     "paraphrases": [
+         "What is the total value of open invoices?", "How much do we owe in unpaid invoices?",
+         "Outstanding invoice value?", "Total open AP balance?",
+         "Value of unpaid invoices?", "How much is still open in payables?",
+         "Sum of open invoices?", "What's our outstanding payables total?",
+         "Unpaid invoice amount overall?", "Open invoice liability?",
+         "How much AP is still open?", "Total of invoices not yet paid?"]},
+    {"id": "return_reasons", "intent": "analytics",
+     "sql": "SELECT reason, COUNT(*) AS returns, ROUND(SUM(refund_amount),2) AS refunds FROM returns GROUP BY reason ORDER BY refunds DESC",
+     "paraphrases": [
+         "Top return reasons by refund amount?", "Why are products being returned?",
+         "Biggest return reasons by refund value?", "Break down returns by reason.",
+         "Which return reasons cost the most?", "Return reasons ranked by refunds?",
+         "What drives our refunds?", "Show returns grouped by reason.",
+         "Most costly return reasons?", "Refund totals per return reason?",
+         "What are the leading causes of returns?", "Return reason breakdown by money?"]},
+    {"id": "ap_health", "intent": "summary",
+     "sql": "SELECT (SELECT COUNT(*) FROM invoices) AS invoices, (SELECT COUNT(*) FROM invoices WHERE status='open') AS open_invoices, (SELECT ROUND(AVG(days_to_pay),1) FROM invoices WHERE status='paid') AS avg_days_to_pay",
+     "paraphrases": [
+         "Summarize accounts payable health.", "Give me an AP health overview.",
+         "How healthy are our payables?", "Overall accounts payable summary?",
+         "Summarize our AP position.", "What's the state of accounts payable?",
+         "AP health check please.", "Overview of payables health?",
+         "How are we doing on accounts payable?", "Summarize payables status.",
+         "Give an executive AP summary.", "State of our AP overall?"]},
+]
+LABELS = [t["id"] for t in TEMPLATES]
+SQL_BY_ID = {t["id"]: t["sql"] for t in TEMPLATES}
+INTENT_BY_ID = {t["id"]: t["intent"] for t in TEMPLATES}
+def build_dataset(seed: int = 7) -> list[dict]:
+    """Flatten templates+paraphrases into instruction-tuning examples (the JSONL)."""
+    rng = random.Random(seed)
+    rows = []
+    for t in TEMPLATES:
+        for q in t["paraphrases"]:
+            rows.append({
+                "task": "nlq", "intent": t["intent"], "template": t["id"],
+                "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.",
+                "input": q, "output": t["sql"],
+            })
+    rng.shuffle(rows)
+    return rows
+# ── offline domain-adaptation trainer (pure numpy) ────────────────────────────
+def _hash_ngrams(text: str, dim: int = 4096) -> np.ndarray:
+    """Hashing-trick feature vector: word unigrams/bigrams + char 3-grams."""
+    text = (text or "").lower()
+    toks = []
+    words = [w for w in "".join(c if c.isalnum() else " " for c in text).split() if w]
+    toks += words
+    toks += [f"{words[i]}_{words[i+1]}" for i in range(len(words) - 1)]
+    s = f" {text} "
+    toks += [s[i:i+3] for i in range(len(s) - 2)]
+    v = np.zeros(dim, dtype=np.float32)
+    for tk in toks:
+        h = int(hashlib.md5(tk.encode()).hexdigest(), 16)
+        v[h % dim] += 1.0
+        if h % 2:  # signed hashing reduces collisions
+            v[(h >> 1) % dim] -= 0.5
+    n = np.linalg.norm(v)
+    return v / n if n else v
+class ErpNlqRouter:
+    """Multinomial softmax classifier (numpy) — the ERP NLQ routing head."""
+    def __init__(self, dim: int = 4096, n_classes: int = len(LABELS), seed: int = 0) -> None:
+        self.dim, self.K = dim, n_classes
+        # small random init ⇒ an untrained head predicts ~uniformly (chance baseline),
+        # an honest "before" reference rather than a degenerate always-class-0.
+        rng = np.random.default_rng(seed)
+        self.W = (rng.normal(0, 0.01, (dim, n_classes))).astype(np.float32)
+        self.b = np.zeros(n_classes, dtype=np.float32)
+    @property
+    def n_params(self) -> int:
+        return int(self.W.size + self.b.size)
+    def _logits(self, X):
+        return X @ self.W + self.b
+    @staticmethod
+    def _softmax(z):
+        z = z - z.max(axis=1, keepdims=True)
+        e = np.exp(z)
+        return e / e.sum(axis=1, keepdims=True)
+    def fit(self, X, y, epochs=250, lr=0.5, l2=1e-4, seed=0):
+        rng = np.random.default_rng(seed)
+        n = X.shape[0]
+        Y = np.eye(self.K, dtype=np.float32)[y]
+        losses = []
+        for ep in range(epochs):
+            idx = rng.permutation(n)
+            P = self._softmax(self._logits(X[idx]))
+            loss = -np.mean(np.sum(Y[idx] * np.log(P + 1e-9), axis=1)) + l2 * np.sum(self.W ** 2)
+            g = (P - Y[idx]) / n
+            self.W -= lr * (X[idx].T @ g + 2 * l2 * self.W)
+            self.b -= lr * g.sum(axis=0)
+            losses.append(round(float(loss), 4))
+        return losses
+    def predict(self, X):
+        return self._logits(X).argmax(axis=1)
+def _featurize(texts, dim=4096):
+    return np.vstack([_hash_ngrams(t, dim) for t in texts])
+def run_offline_finetune(settings, seed: int = 7, epochs: int = 400) -> dict:
+    """Train the ERP NLQ router on the dataset; report BEFORE→AFTER + loss curve."""
+    data = build_dataset(seed)
+    X = _featurize([d["input"] for d in data])
+    y = np.array([LABELS.index(d["template"]) for d in data])
+    # split paraphrases so the test set is unseen phrasings of seen skills
+    rng = np.random.default_rng(seed)
+    perm = rng.permutation(len(data))
+    n_test = max(len(data) // 5, len(LABELS))
+    te, tr = perm[:n_test], perm[n_test:]
+    model = ErpNlqRouter(dim=X.shape[1])
+    # BEFORE: untrained head (random/zero init) — chance-level baseline
+    before_acc = float((model.predict(X[te]) == y[te]).mean())
+    losses = model.fit(X[tr], y[tr], epochs=epochs, seed=seed)
+    after_pred = model.predict(X[te])
+    after_acc = float((after_pred == y[te]).mean())
+    # end-to-end: does the SQL the router selected actually run against the warehouse?
+    wh = get_warehouse(settings)
+    exec_ok = 0
+    for pred in after_pred:
+        try:
+            wh.query(SQL_BY_ID[LABELS[pred]])
+            exec_ok += 1
+        except Exception:
+            pass
+    exec_rate = round(exec_ok / len(te), 3)
+    return {
+        "kind": "offline-domain-adaptation",
+        "model": "ERP-NLQ-router (softmax over hashed n-grams, numpy)",
+        "note": "Offline CPU demo of the training loop + eval on the SAME dataset the "
+                "MiniCPM3-4B LoRA recipe consumes. Trains the NLQ routing head that sits "
+                "in front of the small model; production fine-tune = OpenBMB MiniCPM3-4B LoRA.",
+        "dataset_size": len(data), "train": int(len(tr)), "test": int(len(te)),
+        "n_classes": len(LABELS), "trainable_params": model.n_params,
+        "epochs": epochs,
+        "before_test_accuracy": round(before_acc, 3),
+        "after_test_accuracy": round(after_acc, 3),
+        "accuracy_gain": round(after_acc - before_acc, 3),
+        "routed_sql_exec_rate": exec_rate,
+        "loss_curve": losses[:: max(1, epochs // 40)],
+        "final_loss": losses[-1] if losses else None,
+        "labels": LABELS,
+    }

backend/app/extraction_heuristics.py CHANGED Viewed

@@ -124,12 +124,17 @@ def classify(text: str) -> tuple[str, float]:
         "invoice": sum(k in t for k in ["invoice", "bill to", "amount due", "tax", "subtotal"]),
         "purchase_order": sum(k in t for k in ["purchase order", "p.o.", "po number", "ship to", "buyer"]),
         "contract": sum(k in t for k in ["agreement", "party", "governing law", "term", "whereas", "hereby"]),
-        "receipt": sum(k in t for k in ["receipt", "merchant", "change", "cash", "card ending"]),
         "subscription_memo": sum(k in t for k in ["subscription", "renewal", "billing cycle", "auto-renew", "plan"]),
     }
     # Strong signals override counts.
     if "purchase order" in t or re.search(r"\bp\.?o\.?\s*(number|#|no)", t):
         return "purchase_order", 0.95
     if "invoice" in t:
         scores["invoice"] += 2
     best = max(scores, key=scores.get)

         "invoice": sum(k in t for k in ["invoice", "bill to", "amount due", "tax", "subtotal"]),
         "purchase_order": sum(k in t for k in ["purchase order", "p.o.", "po number", "ship to", "buyer"]),
         "contract": sum(k in t for k in ["agreement", "party", "governing law", "term", "whereas", "hereby"]),
+        "receipt": sum(k in t for k in ["receipt", "merchant", "change", "cash", "card ending",
+                                        "register", "payment:", "thank you"]),
         "subscription_memo": sum(k in t for k in ["subscription", "renewal", "billing cycle", "auto-renew", "plan"]),
     }
     # Strong signals override counts.
     if "purchase order" in t or re.search(r"\bp\.?o\.?\s*(number|#|no)", t):
         return "purchase_order", 0.95
+    # A document that says "receipt" but never "invoice" is a receipt — totals/tax
+    # lines alone must not tip it to invoice (real invoices say "Invoice").
+    if "receipt" in t and "invoice" not in t:
+        return "receipt", 0.9
     if "invoice" in t:
         scores["invoice"] += 2
     best = max(scores, key=scores.get)

backend/app/main.py CHANGED Viewed

@@ -94,6 +94,11 @@ class PromptUpdate(BaseModel):
     content: str
 # --- helpers ------------------------------------------------------------------
 def require_admin(request: Request) -> bool:
     if not _check(request.headers.get("authorization"), settings.admin_user, settings.admin_pass):
@@ -162,6 +167,16 @@ def capabilities():
     caps["ocr"] = {**caps.get("ocr", {}), "registry": ocr_registry.info()}
     caps["categories"] = list_categories()
     caps["mode"] = settings.mode
     return caps
@@ -190,6 +205,83 @@ def ocr_test_report(refresh: bool = False):
     return _ocr_report_cache["report"]
 @app.get("/api/samples")
 def samples():
     return {"samples": _list_samples()}

     content: str
+class ErpChatRequest(BaseModel):
+    question: str
+    use_llm: bool = True
 # --- helpers ------------------------------------------------------------------
 def require_admin(request: Request) -> bool:
     if not _check(request.headers.get("authorization"), settings.admin_user, settings.admin_pass):
     caps["ocr"] = {**caps.get("ocr", {}), "registry": ocr_registry.info()}
     caps["categories"] = list_categories()
     caps["mode"] = settings.mode
+    from .models_registry import model_catalog
+    mc = model_catalog(settings)
+    caps["models"] = {"max_params_b": mc["max_params_b"], "count": mc["count"],
+                      "available": mc["available"], "labs": [l["lab"] for l in mc["labs"]],
+                      "reasoning_capable": mc.get("reasoning_capable", [])}
+    try:
+        from .erp import get_warehouse
+        caps["erp"] = {"enabled": True, "tables": get_warehouse(settings).table_counts()}
+    except Exception as e:  # never let ERP wiring break capabilities
+        caps["erp"] = {"enabled": False, "error": str(e)}
     return caps
     return _ocr_report_cache["report"]
+@app.get("/api/models")
+def models():
+    """Enabled small models (≤32B) from OpenBMB, Cohere, Black Forest Labs."""
+    from .models_registry import model_catalog
+    return model_catalog(settings)
+@app.get("/api/ocr/quality-report")
+def ocr_quality_report(refresh: bool = False, request: Request = None):
+    """OCR output-quality (CER/WER) + document-analysis (field accuracy) per backend.
+    Serves the committed/published report; ?refresh=1 re-runs it (admin only)."""
+    import json as _json
+    if refresh:
+        require_admin(request)
+        from .ocr.quality import run_ocr_quality
+        rep = run_ocr_quality(settings, ocr_registry, router, metrics, db=db, rag_store=rag_store)
+        (settings.writable_dir / "ocr_quality_report.json").write_text(_json.dumps(rep))
+        db.audit("ocr_quality_published", actor=_actor(request),
+                 detail={"best_ocr": rep["best_ocr_quality"]})
+        return rep
+    for p in (settings.writable_dir / "ocr_quality_report.json",
+              settings.eval_report_committed.parent / "ocr_quality_report.json"):
+        if p.exists():
+            return _json.loads(p.read_text())
+    return JSONResponse({"available": False,
+                         "message": "run `python scripts/ocr_quality.py`"}, status_code=200)
+# --- ERP DocIQ (NLQ / analytics / summary / reasons over the ERP knowledgebase) ---
+@app.get("/api/erp/schema")
+def erp_schema():
+    from .erp import get_warehouse
+    from .erp.data import ERP_SCHEMA_DOC, EXAMPLE_QUESTIONS
+    wh = get_warehouse(settings)
+    return {"schema_doc": ERP_SCHEMA_DOC, "tables": wh.table_counts(),
+            "examples": EXAMPLE_QUESTIONS}
+@app.get("/api/erp/reports")
+def erp_reports():
+    """A few canned ERP reports (real data) the chatbot can summarize/explain."""
+    from .erp.chat import (_q_spend_by_month, _q_spend_by_category, _q_top_vendors,
+                           _q_late_vendors, _q_return_reasons)
+    from .erp import get_warehouse
+    wh = get_warehouse(settings)
+    out = {}
+    for name, fn in [("spend_by_month", _q_spend_by_month), ("spend_by_category", _q_spend_by_category),
+                     ("top_vendors", _q_top_vendors), ("late_vendors", _q_late_vendors),
+                     ("return_reasons", _q_return_reasons)]:
+        sql, cols, rows, ans = fn(wh)
+        out[name] = {"columns": cols, "rows": rows, "headline": ans, "sql": sql}
+    return out
+@app.post("/api/erp/chat")
+def erp_chat(req: ErpChatRequest, request: Request = None):
+    """Ask the ERP DocIQ chatbot: NLQ→SQL, analytics, summary, or 'why' reasoning."""
+    from .erp.chat import ErpChat
+    from .erp import get_warehouse
+    chat = ErpChat(settings, router=router, warehouse=get_warehouse(settings),
+                   metrics=metrics, db=db)
+    return chat.answer(req.question, use_llm=req.use_llm, run_id=f"erp-{_actor(request)}")
+@app.get("/api/erp/finetune-report")
+def erp_finetune_report():
+    """Latest fine-tune run (offline domain-adaptation demo + MiniCPM LoRA recipe)."""
+    import json as _json
+    from .config import BACKEND_DIR as _BD
+    for p in (settings.writable_dir / "erp_finetune_report.json",
+              _BD / "finetune" / "erp_finetune_report.json"):
+        if p.exists():
+            return _json.loads(p.read_text())
+    return JSONResponse({"available": False,
+                         "message": "run `python scripts/finetune_erp.py`"}, status_code=200)
 @app.get("/api/samples")
 def samples():
     return {"samples": _list_samples()}

backend/app/models_registry.py ADDED Viewed

	@@ -0,0 +1,102 @@

+"""Enabled model catalogue — small models (≤32B params) from three labs.
+The "Build Small" constraint is ≤32B parameters; every model below also fits in ≤32 GB
+of memory at a sensible quantization. Availability is computed from config/deps so the
+UI can show which are actually live.
+  • OpenBMB (MiniCPM)        — vision-language OCR + text reasoning  (the OCR/IDP engine)
+  • Cohere (Aya-Vision)      — vision-language OCR / VQA
+  • Black Forest Labs (FLUX) — image GENERATION → synthetic test documents (not an OCR
+                               model; used to stress-test OCR quality)
+"""
+from __future__ import annotations
+import importlib.util
+MAX_PARAMS_B = 32  # hackathon "small models" guardrail
+def _has(mod: str) -> bool:
+    return importlib.util.find_spec(mod) is not None
+def model_catalog(settings) -> dict:
+    minicpm_api = bool(settings.minicpm_base_url)
+    transformers = _has("transformers")
+    cohere_enabled = transformers and __import__("os").getenv("COHERE_OCR_ENABLE", "").lower() in {"1", "true", "yes"}
+    cohere_api = bool(getattr(settings, "cohere_api_key", None))
+    diffusers = _has("diffusers")
+    bfl_api = bool(settings.bfl_api_key)
+    labs = [
+        {
+            "lab": "OpenBMB", "homepage": "https://github.com/OpenBMB/MiniCPM",
+            "models": [
+                {"name": "MiniCPM-V-4.6", "id": settings.minicpm_model, "params_b": 8.0,
+                 "size_gb_int4": 5.5, "modality": "vision-language (OCR + reasoning)",
+                 "role": "OCR backend + LLM extractor", "license": "Apache-2.0 (weights: MiniCPM Model License)",
+                 "available": minicpm_api or transformers,
+                 "enable": "MINICPM_BASE_URL (+ MINICPM_API_KEY) OR pip install transformers"},
+                {"name": "MiniCPM-o-4.5", "id": "MiniCPM-o-4.5", "params_b": 8.0,
+                 "size_gb_int4": 5.5, "modality": "omni (vision/audio) VLM",
+                 "role": "alt OCR/VLM", "license": "MiniCPM Model License",
+                 "available": minicpm_api, "enable": "same OpenAI-compatible endpoint"},
+                {"name": "MiniCPM3-4B", "id": settings.openbmb_reasoner_model, "params_b": 4.0,
+                 "size_gb_int4": 2.8, "modality": "text LLM (reasoning + function-calling, 32k ctx)",
+                 "role": "ERP reasoning · NLQ→SQL · report summarization (fine-tune target)",
+                 "license": "Apache-2.0 (weights: MiniCPM Model License)",
+                 "available": minicpm_api or transformers,
+                 "enable": "OpenAI-compatible endpoint (OPENBMB_REASONER_MODEL) OR pip install transformers"},
+            ],
+        },
+        {
+            "lab": "Cohere", "homepage": "https://huggingface.co/CohereLabs",
+            "models": [
+                {"name": "Aya-Vision-8B", "id": settings.cohere_ocr_model, "params_b": 8.0,
+                 "size_gb_int4": 6.0, "modality": "vision-language (OCR/VQA, 23 langs)",
+                 "role": "OCR backend", "license": "CC-BY-NC 4.0",
+                 "available": cohere_enabled,
+                 "enable": "pip install transformers torch + COHERE_OCR_ENABLE=true"},
+                {"name": "Aya-Vision-32B", "id": "CohereLabs/aya-vision-32b", "params_b": 32.0,
+                 "size_gb_int4": 18.0, "modality": "vision-language (OCR/VQA)",
+                 "role": "alt OCR backend (max-quality small)", "license": "CC-BY-NC 4.0",
+                 "available": cohere_enabled,
+                 "enable": "COHERE_OCR_MODEL=CohereLabs/aya-vision-32b + COHERE_OCR_ENABLE=true"},
+                {"name": "Command R7B", "id": "command-r7b-12-2024", "params_b": 7.0,
+                 "size_gb_int4": 5.0, "modality": "text LLM (RAG + tool-use + reasoning, 128k ctx)",
+                 "role": "ERP RAG · NLQ · grounded reasoning", "license": "CC-BY-NC 4.0",
+                 "available": cohere_api,
+                 "enable": "COHERE_API_KEY (api.cohere.com) OR weights CohereLabs/c4ai-command-r7b-12-2024"},
+            ],
+        },
+        {
+            "lab": "Black Forest Labs", "homepage": "https://github.com/black-forest-labs/flux",
+            "models": [
+                {"name": "FLUX.1 [dev]", "id": settings.bfl_model, "params_b": 12.0,
+                 "size_gb_int4": 12.0, "modality": "text-to-image GENERATION",
+                 "role": "synthetic test-document generator (not OCR)", "license": "FLUX.1-dev Non-Commercial",
+                 "available": bfl_api or diffusers,
+                 "enable": "BFL_API_KEY (api.bfl.ml) OR pip install diffusers torch"},
+                {"name": "FLUX.1 [schnell]", "id": "flux-schnell", "params_b": 12.0,
+                 "size_gb_int4": 12.0, "modality": "text-to-image GENERATION (fast)",
+                 "role": "synthetic test-document generator", "license": "Apache-2.0",
+                 "available": bfl_api or diffusers, "enable": "BFL_API_KEY OR pip install diffusers"},
+            ],
+        },
+    ]
+    # guardrail: nothing exceeds the small-model size limit
+    for lab in labs:
+        for m in lab["models"]:
+            assert m["params_b"] <= MAX_PARAMS_B, f"{m['name']} exceeds {MAX_PARAMS_B}B"
+    flat = [{"lab": lab["lab"], **m} for lab in labs for m in lab["models"]]
+    return {
+        "max_params_b": MAX_PARAMS_B,
+        "labs": labs,
+        "available": [m["name"] for m in flat if m["available"]],
+        "ocr_capable": [m["name"] for m in flat if "OCR" in m["modality"] or "vision" in m["modality"]],
+        "reasoning_capable": [m["name"] for m in flat
+                              if "reasoning" in m["modality"] or "text LLM" in m["modality"]],
+        "count": len(flat),
+    }

backend/app/ocr/quality.py ADDED Viewed

	@@ -0,0 +1,207 @@

+"""OCR output-quality + document-analysis benchmark.
+For each available OCR backend, over a set of scanned samples with ground truth, we
+measure two quality dimensions and capture logs + metrics:
+  1. OCR text quality  — Character Error Rate (CER) and Word Error Rate (WER) of the
+     transcribed text vs a reference (the `.txt` sidecar that ships with each scan).
+  2. Document-analysis quality — field exact-match and F1 of the FULL pipeline
+     (OCR → classify → extract → validate) vs the document's ground-truth JSON.
+Plus latency, cost, model name/size. Results are published (file + endpoint + optional
+HF upload). Pure-Python edit distance, no extra deps.
+"""
+from __future__ import annotations
+import json
+import re
+import time
+from pathlib import Path
+from ..observability import log_event
+# scanned samples that have BOTH a .txt sidecar (reference text) and a gt.json, AND
+# that genuinely exercise each OCR engine independently (different CER per backend).
+DEFAULT_SAMPLES = [
+    "invoice_scanned_basic", "receipt_scanned", "po_scanned",
+    "contract_scanned", "subscription_memo_scanned",
+]
+# field-accuracy-only (no sidecar reference text). The extreme tier
+# (scripts/generate_extreme_docs.py — perspective photo, image collage, degraded fax) is a
+# VISION-extraction stress set: on those images the OCR engines fall back to a shared text
+# source (identical CER across backends), so they are excluded from the per-backend CER
+# benchmark and scored on field accuracy only. complex_invoice_messy requires a real VLM.
+FIELD_ONLY_SAMPLES = ["complex_invoice_messy"]
+def _norm(s: str) -> str:
+    return re.sub(r"\s+", " ", (s or "").strip().lower())
+def _lev(a, b) -> int:
+    """Levenshtein distance over any sequences (str or list)."""
+    if a == b:
+        return 0
+    la, lb = len(a), len(b)
+    if la == 0:
+        return lb
+    if lb == 0:
+        return la
+    prev = list(range(lb + 1))
+    for i in range(1, la + 1):
+        cur = [i]
+        ca = a[i - 1]
+        for j in range(1, lb + 1):
+            cur.append(min(prev[j] + 1, cur[j - 1] + 1, prev[j - 1] + (ca != b[j - 1])))
+        prev = cur
+    return prev[lb]
+def cer(hyp: str, ref: str):
+    ref, hyp = _norm(ref), _norm(hyp)
+    if not ref:
+        return None
+    return round(min(1.0, _lev(hyp, ref) / len(ref)), 4)
+def wer(hyp: str, ref: str):
+    rw, hw = _norm(ref).split(), _norm(hyp).split()
+    if not rw:
+        return None
+    return round(min(1.0, _lev(hw, rw) / len(rw)), 4)
+def _model_size(settings, backend_name: str):
+    """Map an OCR backend to its model name + size from the model registry."""
+    from ..models_registry import model_catalog
+    cat = model_catalog(settings)
+    flat = [{**m, "lab": lab["lab"]} for lab in cat["labs"] for m in lab["models"]]
+    if backend_name == "minicpm":
+        m = next((x for x in flat if x["name"].startswith("MiniCPM-V")), None)
+    elif backend_name == "cohere":
+        m = next((x for x in flat if x["name"].startswith("Aya-Vision-8")), None)
+    else:
+        m = None
+    if m:
+        return {"model": m["name"], "params_b": m["params_b"], "size_gb": m["size_gb_int4"], "lab": m["lab"]}
+    return {"model": backend_name, "params_b": None, "size_gb": None, "lab": "classic"}
+def run_ocr_quality(settings, ocr_registry, router, metrics, db=None, rag_store=None,
+                    samples=None) -> dict:
+    from ..pipeline import process_document
+    from evals import scorers
+    samples = samples or DEFAULT_SAMPLES
+    ds = settings.evals_dataset_dir
+    available = [n for n in ocr_registry.available_names() if n != "sidecar"] + ["sidecar"]
+    log_event("info", "OCR quality benchmark started",
+              backends=available, samples=samples + FIELD_ONLY_SAMPLES)
+    backend_rows = []
+    for bname in available:
+        backend = ocr_registry.get(bname)
+        if not backend or not backend.available():
+            continue
+        cers, wers, exacts, f1s, lats, costs = [], [], [], [], [], []
+        per_sample = []
+        for sid in samples + FIELD_ONLY_SAMPLES:
+            doc = _find(ds, sid)
+            if not doc:
+                continue
+            gt = _load_gt(ds, sid)
+            t0 = time.perf_counter()
+            run = process_document(doc, router=router, settings=settings, metrics=metrics,
+                                   ocr_registry=ocr_registry, ocr_backend=bname,
+                                   db=db, rag_store=rag_store, doc_id=f"q-{bname}-{sid}",
+                                   mode="quality")
+            st = run["_state"]
+            ocr_text = (st.get("ocr") or {}).get("text", "")
+            ref = _ref_text(ds, sid)
+            c = cer(ocr_text, ref) if ref else None
+            w = wer(ocr_text, ref) if ref else None
+            score = scorers.score_document(st.get("extracted") or {},
+                                           {k: v for k, v in gt.items() if not k.startswith("_")}) if gt else {}
+            case = {"sample": sid, "cer": c, "wer": w,
+                    "field_exact": score.get("exact_match"), "field_f1": score.get("field_f1"),
+                    "latency_ms": round((time.perf_counter() - t0) * 1000, 1),
+                    "cost_usd": run.get("total_cost_usd", 0.0),
+                    "confidence": st.get("confidence")}
+            per_sample.append(case)
+            if c is not None:
+                cers.append(c); wers.append(w)
+            if score.get("exact_match") is not None:
+                exacts.append(score["exact_match"]); f1s.append(score["field_f1"] or 0)
+            lats.append(case["latency_ms"]); costs.append(case["cost_usd"])
+            log_event("info", f"OCR quality: {bname} on {sid}",
+                      cer=c, wer=w, field_exact=score.get("exact_match"))
+        size = _model_size(settings, bname)
+        row = {
+            "backend": bname, **size,
+            "is_reference": bname == "sidecar",
+            "cer": _avg(cers), "wer": _avg(wers),
+            "field_exact_match": _avg(exacts), "field_f1": _avg(f1s),
+            "avg_latency_ms": _avg(lats), "avg_cost_usd": round(_avg(costs) or 0, 6),
+            "samples_scored": len(per_sample), "per_sample": per_sample,
+        }
+        backend_rows.append(row)
+    # rank: best OCR text quality (lowest CER among real engines), best analysis (highest exact)
+    real = [r for r in backend_rows if not r["is_reference"] and r["cer"] is not None]
+    best_ocr = min(real, key=lambda r: r["cer"])["backend"] if real else None
+    # rank only real engines for "best analysis" — sidecar is the reference text, not a
+    # competing OCR backend, so it must never be crowned best.
+    scored = [r for r in backend_rows
+              if not r["is_reference"] and r["field_exact_match"] is not None]
+    best_analysis = max(scored, key=lambda r: r["field_exact_match"])["backend"] if scored else None
+    report = {
+        "generated_at": time.time(),
+        "note": "CER/WER vs .txt sidecar reference; field accuracy vs gt.json. "
+                "sidecar = reference text source (CER≈0 by construction).",
+        "models": _model_size_table(settings),
+        "backends": backend_rows,
+        "best_ocr_quality": best_ocr,
+        "best_document_analysis": best_analysis,
+    }
+    if db is not None:
+        try:
+            db.audit("ocr_quality_benchmark",
+                     detail={"best_ocr": best_ocr, "best_analysis": best_analysis,
+                             "backends": [r["backend"] for r in backend_rows]})
+        except Exception:
+            pass
+    log_event("info", "OCR quality benchmark complete",
+              best_ocr=best_ocr, best_analysis=best_analysis)
+    return report
+def _model_size_table(settings):
+    from ..models_registry import model_catalog
+    return [{"lab": lab["lab"], **{k: m[k] for k in ("name", "params_b", "size_gb_int4", "modality", "role", "available")}}
+            for lab in model_catalog(settings)["labs"] for m in lab["models"]]
+def _avg(xs):
+    xs = [x for x in xs if x is not None]
+    return round(sum(xs) / len(xs), 4) if xs else None
+def _find(ds: Path, sid: str):
+    for ext in (".png", ".jpg", ".jpeg", ".pdf"):
+        p = ds / f"{sid}{ext}"
+        if p.exists():
+            return p
+    return None
+def _ref_text(ds: Path, sid: str):
+    p = ds / f"{sid}.txt"
+    return p.read_text(encoding="utf-8", errors="ignore") if p.exists() else None
+def _load_gt(ds: Path, sid: str):
+    p = ds / f"{sid}.gt.json"
+    return json.loads(p.read_text()) if p.exists() else None

backend/app/pipeline/nodes.py CHANGED Viewed

@@ -139,14 +139,30 @@ def classify_node(state: dict, ctx: PipelineContext) -> dict:
     )
     resp = ctx.router.run(req, ctx.run_id)
     parsed = _parse_json(resp.text)
-    doc_type = state.get("forced_doc_type") or parsed.get("doc_type", "invoice")
     if doc_type not in SCHEMA_BY_TYPE:
         doc_type = "invoice"
-    conf = float(parsed.get("confidence", 0.6) or 0.6)
     return {
         "doc_type": doc_type,
         "classify_confidence": conf,
-        "_summary": f"Classified as '{doc_type}' (conf={conf:.2f}) via {resp.model}",
     }
@@ -260,6 +276,14 @@ def normalize_node(state: dict, ctx: PipelineContext) -> dict:
     resp = ctx.router.run(req, ctx.run_id)
     normalized = _parse_json(resp.text) or extracted
     normalized = _coerce_to_schema(normalized, state["doc_type"])
     return {
         "normalized": normalized,
         "_summary": f"Normalized dates/currency/amounts via {resp.model}",

     )
     resp = ctx.router.run(req, ctx.run_id)
     parsed = _parse_json(resp.text)
+    llm_type = parsed.get("doc_type", "invoice")
+    conf = float(parsed.get("confidence", 0.6) or 0.6)
+    # Cross-check the LLM against the deterministic keyword heuristic. Small VLMs
+    # (8B) reliably confuse receipts/POs with invoices because totals+tax lines
+    # look alike; the heuristic keys off unambiguous markers ("receipt", "purchase
+    # order"). When the heuristic is highly confident and disagrees, it wins.
+    note = ""
+    try:
+        from ..extraction_heuristics import classify as _heur
+        h_type, h_conf = _heur(text)
+        if h_type in SCHEMA_BY_TYPE and h_type != llm_type and h_conf >= 0.85:
+            note = f" (heuristic override: {llm_type}→{h_type})"
+            llm_type, conf = h_type, max(conf, h_conf)
+    except Exception:
+        pass
+    doc_type = state.get("forced_doc_type") or llm_type
     if doc_type not in SCHEMA_BY_TYPE:
         doc_type = "invoice"
     return {
         "doc_type": doc_type,
         "classify_confidence": conf,
+        "_summary": f"Classified as '{doc_type}' (conf={conf:.2f}) via {resp.model}{note}",
     }
     resp = ctx.router.run(req, ctx.run_id)
     normalized = _parse_json(resp.text) or extracted
     normalized = _coerce_to_schema(normalized, state["doc_type"])
+    # Deterministic date hygiene: small LLMs sometimes leave a stray time component
+    # ("2026-06-02 14:37") or a non-ISO format on date fields — coerce to YYYY-MM-DD.
+    from ..extraction_heuristics import normalize_date as _nd
+    for k, v in list(normalized.items()):
+        if "date" in k and isinstance(v, str) and v:
+            iso = _nd(v)
+            if iso:
+                normalized[k] = iso
     return {
         "normalized": normalized,
         "_summary": f"Normalized dates/currency/amounts via {resp.model}",

backend/app/prompts/__init__.py CHANGED Viewed

@@ -10,13 +10,21 @@ into these prompts.
 """
 from __future__ import annotations
-PROMPT_VERSION = "v1"
 CLASSIFY_SYSTEM = """You are a document classification assistant for an enterprise \
 accounts-payable pipeline.
 Given the first portion of a document, classify it as exactly one of:
-  invoice | purchase_order | contract | receipt | other
 Return ONLY a JSON object, no prose:
   {"doc_type": "<one of the above>", "confidence": <0.0-1.0>, "language": "<iso-639-1>"}

 """
 from __future__ import annotations
+PROMPT_VERSION = "v2"  # v2: receipt/invoice distinctions + subscription_memo in classify
 CLASSIFY_SYSTEM = """You are a document classification assistant for an enterprise \
 accounts-payable pipeline.
 Given the first portion of a document, classify it as exactly one of:
+  invoice | purchase_order | contract | receipt | subscription_memo | other
+Distinctions that matter:
+- receipt = proof of a COMPLETED payment: merchant/store header, register or
+  receipt number, payment method (VISA/cash/card), often "thank you". No due date.
+- invoice = a REQUEST for future payment: invoice number, due date, bill-to,
+  remit-to. The word "Invoice" itself usually appears.
+- purchase_order = a buyer ordering goods: PO number, ship-to, delivery date.
+- subscription_memo = recurring service notice: plan, billing cycle, renewal date.
 Return ONLY a JSON object, no prose:
   {"doc_type": "<one of the above>", "confidence": <0.0-1.0>, "language": "<iso-639-1>"}

backend/app/providers/blackforest.py ADDED Viewed

	@@ -0,0 +1,70 @@

+"""Black Forest Labs (FLUX) — image GENERATION provider.
+FLUX is a ≤12B text-to-image model — NOT an OCR/VLM. In this pipeline its role is to
+generate **synthetic test documents** (e.g. a noisy receipt photo) that we then run the
+OCR backends against, to grow the quality benchmark beyond hand-built samples.
+Backends (gated, graceful):
+  • BFL hosted API (api.bfl.ml)   — set BFL_API_KEY
+  • local diffusers               — pip install diffusers torch (heavy)
+"""
+from __future__ import annotations
+import importlib.util
+import json
+import time
+import urllib.request
+class BlackForestProvider:
+    name = "blackforest"
+    def __init__(self, api_key: str | None, model: str = "flux-dev") -> None:
+        self.api_key = api_key
+        self.model = model
+    def available(self) -> bool:
+        return bool(self.api_key) or importlib.util.find_spec("diffusers") is not None
+    def generate_document(self, prompt: str, width: int = 1024, height: int = 1408,
+                          timeout: int = 60) -> bytes:
+        """Return PNG bytes of a generated synthetic document image."""
+        if self.api_key:
+            return self._via_api(prompt, width, height, timeout)
+        if importlib.util.find_spec("diffusers"):
+            return self._via_diffusers(prompt, width, height)
+        raise RuntimeError("Black Forest Labs not enabled (set BFL_API_KEY or install diffusers)")
+    def _via_api(self, prompt, width, height, timeout) -> bytes:
+        from .minicpm_llm import _parse_json  # reuse robust JSON parse
+        from ..ocr.backends.minicpm import _ssl_context
+        body = json.dumps({"prompt": prompt, "width": width, "height": height}).encode()
+        req = urllib.request.Request(
+            f"https://api.bfl.ml/v1/{self.model}", data=body,
+            headers={"Content-Type": "application/json", "x-key": self.api_key})
+        with urllib.request.urlopen(req, timeout=timeout, context=_ssl_context()) as r:
+            poll_id = json.loads(r.read().decode())["id"]
+        # poll for the result
+        deadline = time.time() + timeout
+        while time.time() < deadline:
+            pr = urllib.request.Request(f"https://api.bfl.ml/v1/get_result?id={poll_id}",
+                                        headers={"x-key": self.api_key})
+            with urllib.request.urlopen(pr, timeout=timeout, context=_ssl_context()) as r:
+                res = json.loads(r.read().decode())
+            if res.get("status") == "Ready":
+                url = res["result"]["sample"]
+                with urllib.request.urlopen(url, timeout=timeout, context=_ssl_context()) as img:
+                    return img.read()
+            time.sleep(1.5)
+        raise TimeoutError("FLUX generation timed out")
+    def _via_diffusers(self, prompt, width, height) -> bytes:
+        import io
+        from diffusers import FluxPipeline
+        import torch
+        pipe = FluxPipeline.from_pretrained("black-forest-labs/FLUX.1-schnell",
+                                            torch_dtype=torch.bfloat16)
+        img = pipe(prompt, width=width, height=height, num_inference_steps=4).images[0]
+        buf = io.BytesIO()
+        img.save(buf, format="PNG")
+        return buf.getvalue()

backend/evals/datasets/extreme_contract_fax.gt.json ADDED Viewed

	@@ -0,0 +1,20 @@

+{
+  "doc_type": "contract",
+  "contract_number": "MSA-2026-0481",
+  "title": "Master Services Agreement - Store Fit-Out Program",
+  "party_a": "Aperture Retail Group",
+  "party_b": "Halcyon Build Partners LLC",
+  "effective_date": "2026-03-01",
+  "expiration_date": "2029-02-28",
+  "contract_value": 1250000.0,
+  "currency": "USD",
+  "governing_law": "State of Ohio",
+  "auto_renew": false,
+  "termination_notice_days": 60,
+  "_meta": {
+    "doc_type": "contract",
+    "channel": "fax",
+    "difficulty": "extreme",
+    "skip_eval": true
+  }
+}

backend/evals/datasets/extreme_contract_fax.png ADDED Viewed

Git LFS Details

SHA256: 312bded6c42d167ad483382c9e5ba5363e65758fc4d25c03aa8478bc9963943e
Pointer size: 132 Bytes
Size of remote file: 1.8 MB

backend/evals/datasets/extreme_contract_fax.txt ADDED Viewed

	@@ -0,0 +1,27 @@

+MASTER SERVICES AGREEMENT - STORE FIT-OUT PROGRAM
+Contract No: MSA-2026-0481
+Party A: Aperture Retail Group   Party B: Halcyon Build Partners LLC
+Effective Date: 2026-03-01   Expiration Date: 2029-02-28
+Total Contract Value: USD 1,250,000.00   Governing Law: State of Ohio
+Auto-Renewal: NO   Termination Notice: 60 days written notice
+1. SCOPE. Contractor shall furnish all labor, materials, supervision and
+equipment required for the fit-out of retail premises identified in each
+Statement of Work executed under this Agreement.
+2. TERM. This Agreement commences on the Effective Date and continues
+until the Expiration Date unless terminated earlier per Section 9.
+3. COMPENSATION. Client shall pay Contractor fees not to exceed the
+Total Contract Value, payable per approved milestone invoices Net 30.
+4. CHANGE ORDERS. No variation is binding unless documented in a
+written change order signed by both parties' authorized representatives.
+5. WARRANTIES. Contractor warrants workmanship free of defects for
+twenty-four (24) months following practical completion of each site.
+6. INSURANCE. Contractor shall maintain commercial general liability
+coverage of not less than USD 5,000,000 per occurrence.
+7. CONFIDENTIALITY. Each party shall protect Confidential Information
+with no less than reasonable care and use it solely for this Agreement.
+8. LIABILITY. Neither party is liable for indirect or consequential
+damages; aggregate liability is capped at the Total Contract Value.
+9. TERMINATION. Either party may terminate for convenience upon sixty
+(60) days written notice, or immediately for uncured material breach.
+10. GOVERNING LAW. This Agreement is governed by the laws of the
+State of Ohio, excluding its conflict of law provisions.

backend/evals/datasets/extreme_po_collage.gt.json ADDED Viewed

	@@ -0,0 +1,40 @@

+{
+  "doc_type": "purchase_order",
+  "order_number": "PO-77RX-3309",
+  "order_date": "2026-05-21",
+  "delivery_date": "2026-06-15",
+  "vendor_name": "Nordic Fixture Works AB",
+  "buyer_name": "Aperture Retail Group",
+  "ship_to": "DC-7, 4420 Logistics Pkwy, Columbus OH",
+  "currency": "USD",
+  "payment_terms": "Net 45",
+  "subtotal": 9600.0,
+  "tax_amount": 792.0,
+  "total": 10392.0,
+  "line_items": [
+    {
+      "description": "SHELF UNIT S-200 heavy gauge",
+      "quantity": 24,
+      "unit_price": 189.0,
+      "line_total": 4536.0
+    },
+    {
+      "description": "LED STRIP 2m retail white",
+      "quantity": 60,
+      "unit_price": 22.4,
+      "line_total": 1344.0
+    },
+    {
+      "description": "ENDCAP DISPLAY birch finish",
+      "quantity": 12,
+      "unit_price": 310.0,
+      "line_total": 3720.0
+    }
+  ],
+  "_meta": {
+    "doc_type": "purchase_order",
+    "channel": "scanned",
+    "difficulty": "extreme",
+    "skip_eval": true
+  }
+}

backend/evals/datasets/extreme_po_collage.png ADDED Viewed

Git LFS Details

SHA256: 78dd797bfa834ef942c6a2d25bec3aa0ced5d21e5dfe45252f55be95a1ef989a
Pointer size: 132 Bytes
Size of remote file: 3.67 MB

backend/evals/datasets/extreme_po_collage.txt ADDED Viewed

	@@ -0,0 +1,20 @@

+PURCHASE ORDER
+Nordic Fixture Works AB
+Industrigatan 14, Malmo SE
+PO Number: PO-77RX-3309
+Order Date: 2026-05-21
+Delivery Date: 2026-06-15
+Payment Terms: Net 45
+Currency: USD
+Buyer: Aperture Retail Group
+Ship To: DC-7, 4420 Logistics Pkwy, Columbus OH
+IMG DESCRIPTION QTY UNIT USD AMOUNT
+SHELF UNIT S-200 heavy gauge 24 189.00 4,536.00
+LED STRIP 2m retail white 60 22.40 1,344.00
+ENDCAP DISPLAY birch finish 12 310.00 3,720.00
+Subtotal: 9,600.00
+Tax 8.25%: 792.00
+TOTAL: 10,392.00 USD
+*PO77RX3309*
+APPROVED · OPS DESK
+Authorized — K. Lindqvist, Procurement

backend/evals/datasets/extreme_receipt_photo.gt.json ADDED Viewed

	@@ -0,0 +1,36 @@

+{
+  "doc_type": "receipt",
+  "merchant": "BREW & BEAN COFFEE Co.",
+  "date": "2026-06-02",
+  "currency": "USD",
+  "subtotal": 30.75,
+  "tax_amount": 2.71,
+  "total": 33.46,
+  "payment_method": "VISA ****4421",
+  "line_items": [
+    {
+      "description": "Flat White",
+      "quantity": 2,
+      "unit_price": 4.75,
+      "line_total": 9.5
+    },
+    {
+      "description": "Butter Croissant",
+      "quantity": 3,
+      "unit_price": 3.25,
+      "line_total": 9.75
+    },
+    {
+      "description": "Cold Brew Growler",
+      "quantity": 1,
+      "unit_price": 14.0,
+      "line_total": 14.0
+    }
+  ],
+  "_meta": {
+    "doc_type": "receipt",
+    "channel": "photo",
+    "difficulty": "extreme",
+    "skip_eval": true
+  }
+}

backend/evals/datasets/extreme_receipt_photo.png ADDED Viewed

Git LFS Details

SHA256: cbc2ac82892034a3ec5396ca7bab5ee9531bea1c47a1a1f237e7543f6d4a2633
Pointer size: 132 Bytes
Size of remote file: 2.29 MB

backend/evals/datasets/extreme_receipt_photo.txt ADDED Viewed

	@@ -0,0 +1,17 @@

+BREW & BEAN COFFEE Co.
+412 Harbor Lane, Portland OR
+Receipt #R-88341  Reg 02
+Date: 2026-06-02  14:37
+Currency: USD
+--------------------------------
+Flat White         2 x 4.75   9.50
+Butter Croissant   3 x 3.25   9.75
+Cold Brew Growler  1 x 14.00 14.00
+Loyalty discount             -2.50
+--------------------------------
+Subtotal                     30.75
+Tax 8.8%                      2.71
+TOTAL                        33.46
+Payment: VISA ****4421
+--------------------------------
+Thank you! brewandbean.example

backend/evals/ocr_backend_report.json ADDED Viewed

	@@ -0,0 +1,135 @@

+{
+  "generated_at": 1781325013.2385988,
+  "mode": "prototype",
+  "samples": [
+    "invoice_scanned_basic",
+    "po_scanned"
+  ],
+  "available_backends": [
+    "tesseract",
+    "sidecar"
+  ],
+  "functional_backends": [
+    "tesseract",
+    "sidecar"
+  ],
+  "backends": [
+    {
+      "name": "minicpm",
+      "tier": "vlm",
+      "requires": "MINICPM_BASE_URL (+ MINICPM_API_KEY) \u2014 vLLM/llama.cpp serving MiniCPM-V-4.6",
+      "available": false,
+      "tested": false,
+      "ok": false,
+      "cases": []
+    },
+    {
+      "name": "cohere",
+      "tier": "local",
+      "requires": "transformers + COHERE_OCR_ENABLE=true (downloads COHERE_OCR_MODEL)",
+      "available": false,
+      "tested": false,
+      "ok": false,
+      "cases": []
+    },
+    {
+      "name": "llamaparse",
+      "tier": "api",
+      "requires": "llama-cloud-services + LLAMA_CLOUD_API_KEY",
+      "available": false,
+      "tested": false,
+      "ok": false,
+      "cases": []
+    },
+    {
+      "name": "tesseract",
+      "tier": "local",
+      "requires": "pytesseract + tesseract binary",
+      "available": true,
+      "tested": true,
+      "ok": true,
+      "cases": [
+        {
+          "sample": "invoice_scanned_basic",
+          "ok": true,
+          "engine": "tesseract",
+          "chars": 257,
+          "simulated": false,
+          "expected_found": [
+            "INVOICE",
+            "NORTHWIND",
+            "INV-7741",
+            "CONTOSO"
+          ],
+          "latency_ms": 412.7,
+          "excerpt": "INVOICE\n\nInvoice Number: INV-7741\nInvoice Date: 2026-03-22\nDue Date: 2026-04-21\nFrom: Northwind Traders\nBill To: Contoso Ltd\n\nCurrency: USD\nDescription aty unit",
+          "error": null
+        },
+        {
+          "sample": "po_scanned",
+          "ok": true,
+          "engine": "tesseract",
+          "chars": 329,
+          "simulated": false,
+          "expected_found": [
+            "PURCHASE",
+            "INITECH"
+          ],
+          "latency_ms": 440.5,
+          "excerpt": "PURCHASE ORDER\n\nPurchase Order Number: P0-100483\norder Date: 2026-04-11\n\nDelivery Date: 2026-05-01\n\nVendor: Initech Supplies\n\nBuyer: Contoso Ops\n\nShip To: 9 Mar",
+          "error": null
+        }
+      ]
+    },
+    {
+      "name": "easyocr",
+      "tier": "local",
+      "requires": "easyocr",
+      "available": false,
+      "tested": false,
+      "ok": false,
+      "cases": []
+    },
+    {
+      "name": "sidecar",
+      "tier": "offline",
+      "requires": "nothing (reads <stem>.txt sidecar)",
+      "available": true,
+      "tested": true,
+      "ok": true,
+      "cases": [
+        {
+          "sample": "invoice_scanned_basic",
+          "ok": true,
+          "engine": "sidecar-fallback",
+          "chars": 323,
+          "simulated": true,
+          "expected_found": [
+            "INVOICE",
+            "NORTHWIND",
+            "INV-7741",
+            "CONTOSO"
+          ],
+          "latency_ms": 0.3,
+          "excerpt": "INVOICE\n\nInvoice Number: INV-7741\nInvoice Date: 2026-03-22\nDue Date: 2026-04-21\nFrom: Northwind Traders\nBill To: Contoso Ltd\nCurrency: USD\n\nDescription         ",
+          "error": null
+        },
+        {
+          "sample": "po_scanned",
+          "ok": true,
+          "engine": "sidecar-fallback",
+          "chars": 389,
+          "simulated": true,
+          "expected_found": [
+            "PURCHASE",
+            "INITECH",
+            "PO-100483"
+          ],
+          "latency_ms": 0.2,
+          "excerpt": "PURCHASE ORDER\n\nPurchase Order Number: PO-100483\nOrder Date: 2026-04-11\nDelivery Date: 2026-05-01\nVendor: Initech Supplies\nBuyer: Contoso Ops\nShip To: 9 Market ",
+          "error": null
+        }
+      ]
+    }
+  ]
+}

backend/evals/ocr_quality_report.json ADDED Viewed

	@@ -0,0 +1,313 @@

+{
+  "generated_at": 1781326743.1110458,
+  "note": "CER/WER vs .txt sidecar reference; field accuracy vs gt.json. sidecar = reference text source (CER\u22480 by construction).",
+  "models": [
+    {
+      "lab": "OpenBMB",
+      "name": "MiniCPM-V-4.6",
+      "params_b": 8.0,
+      "size_gb_int4": 5.5,
+      "modality": "vision-language (OCR + reasoning)",
+      "role": "OCR backend + LLM extractor",
+      "available": true
+    },
+    {
+      "lab": "OpenBMB",
+      "name": "MiniCPM-o-4.5",
+      "params_b": 8.0,
+      "size_gb_int4": 5.5,
+      "modality": "omni (vision/audio) VLM",
+      "role": "alt OCR/VLM",
+      "available": true
+    },
+    {
+      "lab": "OpenBMB",
+      "name": "MiniCPM3-4B",
+      "params_b": 4.0,
+      "size_gb_int4": 2.8,
+      "modality": "text LLM (reasoning + function-calling, 32k ctx)",
+      "role": "ERP reasoning \u00b7 NLQ\u2192SQL \u00b7 report summarization (fine-tune target)",
+      "available": true
+    },
+    {
+      "lab": "Cohere",
+      "name": "Aya-Vision-8B",
+      "params_b": 8.0,
+      "size_gb_int4": 6.0,
+      "modality": "vision-language (OCR/VQA, 23 langs)",
+      "role": "OCR backend",
+      "available": false
+    },
+    {
+      "lab": "Cohere",
+      "name": "Aya-Vision-32B",
+      "params_b": 32.0,
+      "size_gb_int4": 18.0,
+      "modality": "vision-language (OCR/VQA)",
+      "role": "alt OCR backend (max-quality small)",
+      "available": false
+    },
+    {
+      "lab": "Cohere",
+      "name": "Command R7B",
+      "params_b": 7.0,
+      "size_gb_int4": 5.0,
+      "modality": "text LLM (RAG + tool-use + reasoning, 128k ctx)",
+      "role": "ERP RAG \u00b7 NLQ \u00b7 grounded reasoning",
+      "available": false
+    },
+    {
+      "lab": "Black Forest Labs",
+      "name": "FLUX.1 [dev]",
+      "params_b": 12.0,
+      "size_gb_int4": 12.0,
+      "modality": "text-to-image GENERATION",
+      "role": "synthetic test-document generator (not OCR)",
+      "available": false
+    },
+    {
+      "lab": "Black Forest Labs",
+      "name": "FLUX.1 [schnell]",
+      "params_b": 12.0,
+      "size_gb_int4": 12.0,
+      "modality": "text-to-image GENERATION (fast)",
+      "role": "synthetic test-document generator",
+      "available": false
+    }
+  ],
+  "backends": [
+    {
+      "backend": "minicpm",
+      "model": "MiniCPM-V-4.6",
+      "params_b": 8.0,
+      "size_gb": 5.5,
+      "lab": "OpenBMB",
+      "is_reference": false,
+      "cer": 0.0262,
+      "wer": 0.0876,
+      "field_exact_match": 0.907,
+      "field_f1": 0.9397,
+      "avg_latency_ms": 6524.8167,
+      "avg_cost_usd": 0.0002,
+      "samples_scored": 6,
+      "per_sample": [
+        {
+          "sample": "invoice_scanned_basic",
+          "cer": 0.0,
+          "wer": 0.0,
+          "field_exact": 0.889,
+          "field_f1": 0.9,
+          "latency_ms": 5560.8,
+          "cost_usd": 0.0001952,
+          "confidence": 0.7
+        },
+        {
+          "sample": "receipt_scanned",
+          "cer": 0.0942,
+          "wer": 0.3103,
+          "field_exact": 1.0,
+          "field_f1": 1.0,
+          "latency_ms": 4218.7,
+          "cost_usd": 0.0001883,
+          "confidence": 0.98
+        },
+        {
+          "sample": "po_scanned",
+          "cer": 0.0368,
+          "wer": 0.1277,
+          "field_exact": 1.0,
+          "field_f1": 1.0,
+          "latency_ms": 4404.9,
+          "cost_usd": 0.0001835,
+          "confidence": 0.98
+        },
+        {
+          "sample": "contract_scanned",
+          "cer": 0.0,
+          "wer": 0.0,
+          "field_exact": 0.636,
+          "field_f1": 0.8,
+          "latency_ms": 6532.2,
+          "cost_usd": 0.000166,
+          "confidence": 0.98
+        },
+        {
+          "sample": "subscription_memo_scanned",
+          "cer": 0.0,
+          "wer": 0.0,
+          "field_exact": 0.917,
+          "field_f1": 0.938,
+          "latency_ms": 5010.4,
+          "cost_usd": 0.0001924,
+          "confidence": 0.98
+        },
+        {
+          "sample": "complex_invoice_messy",
+          "cer": null,
+          "wer": null,
+          "field_exact": 1.0,
+          "field_f1": 1.0,
+          "latency_ms": 13421.9,
+          "cost_usd": 0.0004414,
+          "confidence": 0.98
+        }
+      ]
+    },
+    {
+      "backend": "tesseract",
+      "model": "tesseract",
+      "params_b": null,
+      "size_gb": null,
+      "lab": "classic",
+      "is_reference": false,
+      "cer": 0.1468,
+      "wer": 0.1848,
+      "field_exact_match": 0.907,
+      "field_f1": 0.9397,
+      "avg_latency_ms": 3436.8667,
+      "avg_cost_usd": 0.0001,
+      "samples_scored": 6,
+      "per_sample": [
+        {
+          "sample": "invoice_scanned_basic",
+          "cer": 0.1225,
+          "wer": 0.1389,
+          "field_exact": 0.889,
+          "field_f1": 0.9,
+          "latency_ms": 3698.6,
+          "cost_usd": 0.0001242,
+          "confidence": 0.68
+        },
+        {
+          "sample": "receipt_scanned",
+          "cer": 0.4555,
+          "wer": 0.5172,
+          "field_exact": 1.0,
+          "field_f1": 1.0,
+          "latency_ms": 2861.1,
+          "cost_usd": 0.0001207,
+          "confidence": 0.96
+        },
+        {
+          "sample": "po_scanned",
+          "cer": 0.0951,
+          "wer": 0.1489,
+          "field_exact": 1.0,
+          "field_f1": 1.0,
+          "latency_ms": 3390.6,
+          "cost_usd": 0.000118,
+          "confidence": 0.96
+        },
+        {
+          "sample": "contract_scanned",
+          "cer": 0.0,
+          "wer": 0.0,
+          "field_exact": 0.636,
+          "field_f1": 0.8,
+          "latency_ms": 2336.4,
+          "cost_usd": 7.97e-05,
+          "confidence": 0.96
+        },
+        {
+          "sample": "subscription_memo_scanned",
+          "cer": 0.061,
+          "wer": 0.119,
+          "field_exact": 0.917,
+          "field_f1": 0.938,
+          "latency_ms": 2243.8,
+          "cost_usd": 0.0001211,
+          "confidence": 0.96
+        },
+        {
+          "sample": "complex_invoice_messy",
+          "cer": null,
+          "wer": null,
+          "field_exact": 1.0,
+          "field_f1": 1.0,
+          "latency_ms": 6090.7,
+          "cost_usd": 0.0002828,
+          "confidence": 0.96
+        }
+      ]
+    },
+    {
+      "backend": "sidecar",
+      "model": "sidecar",
+      "params_b": null,
+      "size_gb": null,
+      "lab": "classic",
+      "is_reference": true,
+      "cer": 0.0,
+      "wer": 0.0,
+      "field_exact_match": 0.907,
+      "field_f1": 0.9397,
+      "avg_latency_ms": 3235.3167,
+      "avg_cost_usd": 0.0001,
+      "samples_scored": 6,
+      "per_sample": [
+        {
+          "sample": "invoice_scanned_basic",
+          "cer": 0.0,
+          "wer": 0.0,
+          "field_exact": 0.889,
+          "field_f1": 0.9,
+          "latency_ms": 1697.6,
+          "cost_usd": 9.45e-05,
+          "confidence": 0.66
+        },
+        {
+          "sample": "receipt_scanned",
+          "cer": 0.0,
+          "wer": 0.0,
+          "field_exact": 1.0,
+          "field_f1": 1.0,
+          "latency_ms": 2126.2,
+          "cost_usd": 0.0001212,
+          "confidence": 0.94
+        },
+        {
+          "sample": "po_scanned",
+          "cer": 0.0,
+          "wer": 0.0,
+          "field_exact": 1.0,
+          "field_f1": 1.0,
+          "latency_ms": 2194.4,
+          "cost_usd": 0.0001184,
+          "confidence": 0.94
+        },
+        {
+          "sample": "contract_scanned",
+          "cer": 0.0,
+          "wer": 0.0,
+          "field_exact": 0.636,
+          "field_f1": 0.8,
+          "latency_ms": 1522.1,
+          "cost_usd": 7.97e-05,
+          "confidence": 0.94
+        },
+        {
+          "sample": "subscription_memo_scanned",
+          "cer": 0.0,
+          "wer": 0.0,
+          "field_exact": 0.917,
+          "field_f1": 0.938,
+          "latency_ms": 1632.7,
+          "cost_usd": 9.16e-05,
+          "confidence": 0.94
+        },
+        {
+          "sample": "complex_invoice_messy",
+          "cer": null,
+          "wer": null,
+          "field_exact": 1.0,
+          "field_f1": 1.0,
+          "latency_ms": 10238.9,
+          "cost_usd": 0.0002297,
+          "confidence": 0.98
+        }
+      ]
+    }
+  ],
+  "best_ocr_quality": "minicpm",
+  "best_document_analysis": "minicpm"
+}

backend/evals/report.json ADDED Viewed

	@@ -0,0 +1,980 @@

+{
+  "aggregate": {
+    "overall": {
+      "documents": 13,
+      "exact_match": 0.932,
+      "field_f1": 0.938,
+      "line_item_f1": 0.667,
+      "financial_consistency_rate": 1.0,
+      "doc_type_accuracy": 1.0
+    },
+    "by_type": {
+      "contract": {
+        "documents": 2,
+        "exact_match": 0.818,
+        "field_f1": 0.856,
+        "financial_consistency_rate": 1.0
+      },
+      "invoice": {
+        "documents": 5,
+        "exact_match": 1.0,
+        "field_f1": 1.0,
+        "financial_consistency_rate": 1.0
+      },
+      "purchase_order": {
+        "documents": 2,
+        "exact_match": 0.863,
+        "field_f1": 0.925,
+        "financial_consistency_rate": 1.0
+      },
+      "receipt": {
+        "documents": 2,
+        "exact_match": 1.0,
+        "field_f1": 1.0,
+        "financial_consistency_rate": 1.0
+      },
+      "subscription_memo": {
+        "documents": 2,
+        "exact_match": 0.875,
+        "field_f1": 0.817,
+        "financial_consistency_rate": 1.0
+      }
+    },
+    "by_difficulty": {
+      "standard": {
+        "documents": 10,
+        "exact_match": 0.911,
+        "field_f1": 0.92,
+        "financial_consistency_rate": 1.0
+      },
+      "dense_table": {
+        "documents": 1,
+        "exact_match": 1.0,
+        "field_f1": 1.0,
+        "financial_consistency_rate": 1.0
+      },
+      "multicurrency": {
+        "documents": 1,
+        "exact_match": 1.0,
+        "field_f1": 1.0,
+        "financial_consistency_rate": 1.0
+      },
+      "missing_fields": {
+        "documents": 1,
+        "exact_match": 1.0,
+        "field_f1": 1.0,
+        "financial_consistency_rate": 1.0
+      }
+    }
+  },
+  "documents": [
+    {
+      "doc_id": "contract_msa_digital",
+      "predicted_type": "contract",
+      "difficulty": "standard",
+      "channel": "digital",
+      "confidence": 1.0,
+      "requires_review": false,
+      "cost_usd": 0.0,
+      "score": {
+        "doc_type": "contract",
+        "exact_match": 0.909,
+        "field_f1": 0.897,
+        "precision": 1.0,
+        "recall": 0.812,
+        "fields_scored": 11,
+        "line_item_f1": null,
+        "line_items_applicable": false,
+        "financial_consistent": true,
+        "per_field": {
+          "contract_number": {
+            "exact": true,
+            "pred": "MSA-2026-014",
+            "gt": "MSA-2026-014"
+          },
+          "title": {
+            "exact": false,
+            "pred": null,
+            "gt": "Master Services Agreement"
+          },
+          "party_a": {
+            "exact": true,
+            "pred": "Acme Industrial Supplies",
+            "gt": "Acme Industrial Supplies"
+          },
+          "party_b": {
+            "exact": true,
+            "pred": "Globex Corporation",
+            "gt": "Globex Corporation"
+          },
+          "effective_date": {
+            "exact": true,
+            "pred": "2026-01-01",
+            "gt": "2026-01-01"
+          },
+          "expiration_date": {
+            "exact": true,
+            "pred": "2027-12-31",
+            "gt": "2027-12-31"
+          },
+          "contract_value": {
+            "exact": true,
+            "pred": 250000.0,
+            "gt": 250000.0
+          },
+          "currency": {
+            "exact": true,
+            "pred": "USD",
+            "gt": "USD"
+          },
+          "governing_law": {
+            "exact": true,
+            "pred": "Delaware",
+            "gt": "Delaware"
+          },
+          "auto_renew": {
+            "exact": true,
+            "pred": true,
+            "gt": true
+          },
+          "termination_notice_days": {
+            "exact": true,
+            "pred": 60,
+            "gt": 60
+          }
+        }
+      }
+    },
+    {
+      "doc_id": "contract_scanned",
+      "predicted_type": "contract",
+      "difficulty": "standard",
+      "channel": "scanned",
+      "confidence": 0.8,
+      "requires_review": true,
+      "cost_usd": 0.0,
+      "score": {
+        "doc_type": "contract",
+        "exact_match": 0.727,
+        "field_f1": 0.815,
+        "precision": 1.0,
+        "recall": 0.688,
+        "fields_scored": 11,
+        "line_item_f1": null,
+        "line_items_applicable": false,
+        "financial_consistent": true,
+        "per_field": {
+          "contract_number": {
+            "exact": true,
+            "pred": "NDA-7781",
+            "gt": "NDA-7781"
+          },
+          "title": {
+            "exact": false,
+            "pred": null,
+            "gt": "Mutual Non-Disclosure Agreement"
+          },
+          "party_a": {
+            "exact": true,
+            "pred": "Stark Components",
+            "gt": "Stark Components"
+          },
+          "party_b": {
+            "exact": true,
+            "pred": "Wayne Enterprises",
+            "gt": "Wayne Enterprises"
+          },
+          "effective_date": {
+            "exact": true,
+            "pred": "2026-03-15",
+            "gt": "2026-03-15"
+          },
+          "expiration_date": {
+            "exact": true,
+            "pred": "2029-03-14",
+            "gt": "2029-03-14"
+          },
+          "contract_value": {
+            "exact": false,
+            "pred": null,
+            "gt": 0.0
+          },
+          "currency": {
+            "exact": false,
+            "pred": null,
+            "gt": "USD"
+          },
+          "governing_law": {
+            "exact": true,
+            "pred": "New York",
+            "gt": "New York"
+          },
+          "auto_renew": {
+            "exact": true,
+            "pred": false,
+            "gt": false
+          },
+          "termination_notice_days": {
+            "exact": true,
+            "pred": 30,
+            "gt": 30
+          }
+        }
+      }
+    },
+    {
+      "doc_id": "invoice_acme_digital",
+      "predicted_type": "invoice",
+      "difficulty": "standard",
+      "channel": "digital",
+      "confidence": 1.0,
+      "requires_review": false,
+      "cost_usd": 0.0,
+      "score": {
+        "doc_type": "invoice",
+        "exact_match": 1.0,
+        "field_f1": 1.0,
+        "precision": 1.0,
+        "recall": 1.0,
+        "fields_scored": 9,
+        "line_item_f1": 1.0,
+        "line_items_applicable": true,
+        "financial_consistent": true,
+        "per_field": {
+          "invoice_number": {
+            "exact": true,
+            "pred": "INV-1001",
+            "gt": "INV-1001"
+          },
+          "issue_date": {
+            "exact": true,
+            "pred": "2026-07-15",
+            "gt": "2026-07-15"
+          },
+          "due_date": {
+            "exact": true,
+            "pred": "2026-08-14",
+            "gt": "2026-08-14"
+          },
+          "vendor_name": {
+            "exact": true,
+            "pred": "Acme Industrial Supplies",
+            "gt": "Acme Industrial Supplies"
+          },
+          "bill_to_name": {
+            "exact": true,
+            "pred": "Globex Corporation",
+            "gt": "Globex Corporation"
+          },
+          "currency": {
+            "exact": true,
+            "pred": "USD",
+            "gt": "USD"
+          },
+          "subtotal": {
+            "exact": true,
+            "pred": 300.0,
+            "gt": 300.0
+          },
+          "tax_amount": {
+            "exact": true,
+            "pred": 30.0,
+            "gt": 30.0
+          },
+          "total": {
+            "exact": true,
+            "pred": 330.0,
+            "gt": 330.0
+          }
+        }
+      }
+    },
+    {
+      "doc_id": "invoice_dense_table",
+      "predicted_type": "invoice",
+      "difficulty": "dense_table",
+      "channel": "digital",
+      "confidence": 1.0,
+      "requires_review": false,
+      "cost_usd": 0.0,
+      "score": {
+        "doc_type": "invoice",
+        "exact_match": 1.0,
+        "field_f1": 1.0,
+        "precision": 1.0,
+        "recall": 1.0,
+        "fields_scored": 9,
+        "line_item_f1": 1.0,
+        "line_items_applicable": true,
+        "financial_consistent": true,
+        "per_field": {
+          "invoice_number": {
+            "exact": true,
+            "pred": "INV-9120",
+            "gt": "INV-9120"
+          },
+          "issue_date": {
+            "exact": true,
+            "pred": "2026-06-01",
+            "gt": "2026-06-01"
+          },
+          "due_date": {
+            "exact": true,
+            "pred": "2026-07-01",
+            "gt": "2026-07-01"
+          },
+          "vendor_name": {
+            "exact": true,
+            "pred": "Wayne Enterprises",
+            "gt": "Wayne Enterprises"
+          },
+          "bill_to_name": {
+            "exact": true,
+            "pred": "Oscorp",
+            "gt": "Oscorp"
+          },
+          "currency": {
+            "exact": true,
+            "pred": "USD",
+            "gt": "USD"
+          },
+          "subtotal": {
+            "exact": true,
+            "pred": 2140.0,
+            "gt": 2140.0
+          },
+          "tax_amount": {
+            "exact": true,
+            "pred": 214.0,
+            "gt": 214.0
+          },
+          "total": {
+            "exact": true,
+            "pred": 2354.0,
+            "gt": 2354.0
+          }
+        }
+      }
+    },
+    {
+      "doc_id": "invoice_globalparts_eur",
+      "predicted_type": "invoice",
+      "difficulty": "multicurrency",
+      "channel": "digital",
+      "confidence": 1.0,
+      "requires_review": false,
+      "cost_usd": 0.0,
+      "score": {
+        "doc_type": "invoice",
+        "exact_match": 1.0,
+        "field_f1": 1.0,
+        "precision": 1.0,
+        "recall": 1.0,
+        "fields_scored": 9,
+        "line_item_f1": 1.0,
+        "line_items_applicable": true,
+        "financial_consistent": true,
+        "per_field": {
+          "invoice_number": {
+            "exact": true,
+            "pred": "GP-2026-558",
+            "gt": "GP-2026-558"
+          },
+          "issue_date": {
+            "exact": true,
+            "pred": "2026-05-03",
+            "gt": "2026-05-03"
+          },
+          "due_date": {
+            "exact": true,
+            "pred": "2026-06-02",
+            "gt": "2026-06-02"
+          },
+          "vendor_name": {
+            "exact": true,
+            "pred": "GlobalParts GmbH",
+            "gt": "GlobalParts GmbH"
+          },
+          "bill_to_name": {
+            "exact": true,
+            "pred": "Initech LLC",
+            "gt": "Initech LLC"
+          },
+          "currency": {
+            "exact": true,
+            "pred": "EUR",
+            "gt": "EUR"
+          },
+          "subtotal": {
+            "exact": true,
+            "pred": 1840.0,
+            "gt": 1840.0
+          },
+          "tax_amount": {
+            "exact": true,
+            "pred": 349.6,
+            "gt": 349.6
+          },
+          "total": {
+            "exact": true,
+            "pred": 2189.6,
+            "gt": 2189.6
+          }
+        }
+      }
+    },
+    {
+      "doc_id": "invoice_missing_total",
+      "predicted_type": "invoice",
+      "difficulty": "missing_fields",
+      "channel": "digital",
+      "confidence": 0.72,
+      "requires_review": true,
+      "cost_usd": 0.0,
+      "score": {
+        "doc_type": "invoice",
+        "exact_match": 1.0,
+        "field_f1": 1.0,
+        "precision": 1.0,
+        "recall": 1.0,
+        "fields_scored": 6,
+        "line_item_f1": 1.0,
+        "line_items_applicable": true,
+        "financial_consistent": true,
+        "per_field": {
+          "invoice_number": {
+            "exact": true,
+            "pred": "INV-3300",
+            "gt": "INV-3300"
+          },
+          "issue_date": {
+            "exact": true,
+            "pred": "2026-02-10",
+            "gt": "2026-02-10"
+          },
+          "vendor_name": {
+            "exact": true,
+            "pred": "Stark Components",
+            "gt": "Stark Components"
+          },
+          "currency": {
+            "exact": true,
+            "pred": "USD",
+            "gt": "USD"
+          },
+          "subtotal": {
+            "exact": true,
+            "pred": 1200.0,
+            "gt": 1200.0
+          },
+          "tax_amount": {
+            "exact": true,
+            "pred": 96.0,
+            "gt": 96.0
+          }
+        }
+      }
+    },
+    {
+      "doc_id": "invoice_scanned_basic",
+      "predicted_type": "invoice",
+      "difficulty": "standard",
+      "channel": "scanned",
+      "confidence": 0.8,
+      "requires_review": true,
+      "cost_usd": 0.0,
+      "score": {
+        "doc_type": "invoice",
+        "exact_match": 1.0,
+        "field_f1": 1.0,
+        "precision": 1.0,
+        "recall": 1.0,
+        "fields_scored": 9,
+        "line_item_f1": 0.0,
+        "line_items_applicable": true,
+        "financial_consistent": true,
+        "per_field": {
+          "invoice_number": {
+            "exact": true,
+            "pred": "INV-7741",
+            "gt": "INV-7741"
+          },
+          "issue_date": {
+            "exact": true,
+            "pred": "2026-03-22",
+            "gt": "2026-03-22"
+          },
+          "due_date": {
+            "exact": true,
+            "pred": "2026-04-21",
+            "gt": "2026-04-21"
+          },
+          "vendor_name": {
+            "exact": true,
+            "pred": "Northwind Traders",
+            "gt": "Northwind Traders"
+          },
+          "bill_to_name": {
+            "exact": true,
+            "pred": "Contoso Ltd",
+            "gt": "Contoso Ltd"
+          },
+          "currency": {
+            "exact": true,
+            "pred": "USD",
+            "gt": "USD"
+          },
+          "subtotal": {
+            "exact": true,
+            "pred": 540.0,
+            "gt": 540.0
+          },
+          "tax_amount": {
+            "exact": true,
+            "pred": 43.2,
+            "gt": 43.2
+          },
+          "total": {
+            "exact": true,
+            "pred": 583.2,
+            "gt": 583.2
+          }
+        }
+      }
+    },
+    {
+      "doc_id": "po_acme_digital",
+      "predicted_type": "purchase_order",
+      "difficulty": "standard",
+      "channel": "digital",
+      "confidence": 1.0,
+      "requires_review": false,
+      "cost_usd": 0.0,
+      "score": {
+        "doc_type": "purchase_order",
+        "exact_match": 0.909,
+        "field_f1": 0.941,
+        "precision": 0.941,
+        "recall": 0.941,
+        "fields_scored": 11,
+        "line_item_f1": 1.0,
+        "line_items_applicable": true,
+        "financial_consistent": true,
+        "per_field": {
+          "order_number": {
+            "exact": false,
+            "pred": "Purchase",
+            "gt": "PO-100481"
+          },
+          "order_date": {
+            "exact": true,
+            "pred": "2026-07-02",
+            "gt": "2026-07-02"
+          },
+          "delivery_date": {
+            "exact": true,
+            "pred": "2026-07-20",
+            "gt": "2026-07-20"
+          },
+          "vendor_name": {
+            "exact": true,
+            "pred": "Acme Industrial",
+            "gt": "Acme Industrial"
+          },
+          "buyer_name": {
+            "exact": true,
+            "pred": "Globex Procurement",
+            "gt": "Globex Procurement"
+          },
+          "ship_to": {
+            "exact": true,
+            "pred": "12 Industrial Way, Springfield",
+            "gt": "12 Industrial Way, Springfield"
+          },
+          "currency": {
+            "exact": true,
+            "pred": "USD",
+            "gt": "USD"
+          },
+          "subtotal": {
+            "exact": true,
+            "pred": 12000.0,
+            "gt": 12000.0
+          },
+          "tax_amount": {
+            "exact": true,
+            "pred": 450.0,
+            "gt": 450.0
+          },
+          "total": {
+            "exact": true,
+            "pred": 12450.0,
+            "gt": 12450.0
+          },
+          "payment_terms": {
+            "exact": true,
+            "pred": "Net 30",
+            "gt": "Net 30"
+          }
+        }
+      }
+    },
+    {
+      "doc_id": "po_scanned",
+      "predicted_type": "purchase_order",
+      "difficulty": "standard",
+      "channel": "scanned",
+      "confidence": 0.52,
+      "requires_review": true,
+      "cost_usd": 0.0,
+      "score": {
+        "doc_type": "purchase_order",
+        "exact_match": 0.818,
+        "field_f1": 0.909,
+        "precision": 0.938,
+        "recall": 0.882,
+        "fields_scored": 11,
+        "line_item_f1": 0.0,
+        "line_items_applicable": true,
+        "financial_consistent": true,
+        "per_field": {
+          "order_number": {
+            "exact": false,
+            "pred": "Purchase",
+            "gt": "PO-100483"
+          },
+          "order_date": {
+            "exact": true,
+            "pred": "2026-04-11",
+            "gt": "2026-04-11"
+          },
+          "delivery_date": {
+            "exact": true,
+            "pred": "2026-05-01",
+            "gt": "2026-05-01"
+          },
+          "vendor_name": {
+            "exact": true,
+            "pred": "Initech Supplies",
+            "gt": "Initech Supplies"
+          },
+          "buyer_name": {
+            "exact": true,
+            "pred": "Contoso Ops",
+            "gt": "Contoso Ops"
+          },
+          "ship_to": {
+            "exact": true,
+            "pred": "9 Market St, Metropolis",
+            "gt": "9 Market St, Metropolis"
+          },
+          "currency": {
+            "exact": true,
+            "pred": "USD",
+            "gt": "USD"
+          },
+          "subtotal": {
+            "exact": true,
+            "pred": 900.0,
+            "gt": 900.0
+          },
+          "tax_amount": {
+            "exact": true,
+            "pred": 80.0,
+            "gt": 80.0
+          },
+          "total": {
+            "exact": false,
+            "pred": null,
+            "gt": 980.0
+          },
+          "payment_terms": {
+            "exact": true,
+            "pred": "Net 15",
+            "gt": "Net 15"
+          }
+        }
+      }
+    },
+    {
+      "doc_id": "receipt_digital",
+      "predicted_type": "receipt",
+      "difficulty": "standard",
+      "channel": "digital",
+      "confidence": 1.0,
+      "requires_review": false,
+      "cost_usd": 0.0,
+      "score": {
+        "doc_type": "receipt",
+        "exact_match": 1.0,
+        "field_f1": 1.0,
+        "precision": 1.0,
+        "recall": 1.0,
+        "fields_scored": 7,
+        "line_item_f1": 1.0,
+        "line_items_applicable": true,
+        "financial_consistent": true,
+        "per_field": {
+          "merchant": {
+            "exact": true,
+            "pred": "City Hardware",
+            "gt": "City Hardware"
+          },
+          "date": {
+            "exact": true,
+            "pred": "2026-06-18",
+            "gt": "2026-06-18"
+          },
+          "currency": {
+            "exact": true,
+            "pred": "USD",
+            "gt": "USD"
+          },
+          "subtotal": {
+            "exact": true,
+            "pred": 47.0,
+            "gt": 47.0
+          },
+          "tax_amount": {
+            "exact": true,
+            "pred": 3.76,
+            "gt": 3.76
+          },
+          "total": {
+            "exact": true,
+            "pred": 50.76,
+            "gt": 50.76
+          },
+          "payment_method": {
+            "exact": true,
+            "pred": "Visa card ending 4242",
+            "gt": "Visa card ending 4242"
+          }
+        }
+      }
+    },
+    {
+      "doc_id": "receipt_scanned",
+      "predicted_type": "receipt",
+      "difficulty": "standard",
+      "channel": "scanned",
+      "confidence": 0.8,
+      "requires_review": true,
+      "cost_usd": 0.0,
+      "score": {
+        "doc_type": "receipt",
+        "exact_match": 1.0,
+        "field_f1": 1.0,
+        "precision": 1.0,
+        "recall": 1.0,
+        "fields_scored": 7,
+        "line_item_f1": 0.0,
+        "line_items_applicable": true,
+        "financial_consistent": true,
+        "per_field": {
+          "merchant": {
+            "exact": true,
+            "pred": "QuickMart",
+            "gt": "QuickMart"
+          },
+          "date": {
+            "exact": true,
+            "pred": "2026-05-30",
+            "gt": "2026-05-30"
+          },
+          "currency": {
+            "exact": true,
+            "pred": "USD",
+            "gt": "USD"
+          },
+          "subtotal": {
+            "exact": true,
+            "pred": 23.5,
+            "gt": 23.5
+          },
+          "tax_amount": {
+            "exact": true,
+            "pred": 1.88,
+            "gt": 1.88
+          },
+          "total": {
+            "exact": true,
+            "pred": 25.38,
+            "gt": 25.38
+          },
+          "payment_method": {
+            "exact": true,
+            "pred": "Cash",
+            "gt": "Cash"
+          }
+        }
+      }
+    },
+    {
+      "doc_id": "subscription_memo_pos",
+      "predicted_type": "subscription_memo",
+      "difficulty": "standard",
+      "channel": "digital",
+      "confidence": 1.0,
+      "requires_review": false,
+      "cost_usd": 0.0,
+      "score": {
+        "doc_type": "subscription_memo",
+        "exact_match": 0.917,
+        "field_f1": 0.875,
+        "precision": 0.933,
+        "recall": 0.824,
+        "fields_scored": 12,
+        "line_item_f1": null,
+        "line_items_applicable": false,
+        "financial_consistent": true,
+        "per_field": {
+          "memo_number": {
+            "exact": true,
+            "pred": "SUB-2026-0091",
+            "gt": "SUB-2026-0091"
+          },
+          "subscription_name": {
+            "exact": false,
+            "pred": "MEMO",
+            "gt": "POS Cloud Platform"
+          },
+          "vendor_name": {
+            "exact": true,
+            "pred": "Initech Supplies",
+            "gt": "Initech Supplies"
+          },
+          "account_id": {
+            "exact": true,
+            "pred": "ACC-55821",
+            "gt": "ACC-55821"
+          },
+          "plan": {
+            "exact": true,
+            "pred": "Enterprise (500 lanes)",
+            "gt": "Enterprise (500 lanes)"
+          },
+          "billing_cycle": {
+            "exact": true,
+            "pred": "annual",
+            "gt": "annual"
+          },
+          "start_date": {
+            "exact": true,
+            "pred": "2025-08-01",
+            "gt": "2025-08-01"
+          },
+          "renewal_date": {
+            "exact": true,
+            "pred": "2026-08-01",
+            "gt": "2026-08-01"
+          },
+          "amount": {
+            "exact": true,
+            "pred": 84000.0,
+            "gt": 84000.0
+          },
+          "currency": {
+            "exact": true,
+            "pred": "USD",
+            "gt": "USD"
+          },
+          "auto_renew": {
+            "exact": true,
+            "pred": true,
+            "gt": true
+          },
+          "status": {
+            "exact": true,
+            "pred": "active",
+            "gt": "active"
+          }
+        }
+      }
+    },
+    {
+      "doc_id": "subscription_memo_scanned",
+      "predicted_type": "subscription_memo",
+      "difficulty": "standard",
+      "channel": "scanned",
+      "confidence": 0.8,
+      "requires_review": true,
+      "cost_usd": 0.0,
+      "score": {
+        "doc_type": "subscription_memo",
+        "exact_match": 0.833,
+        "field_f1": 0.759,
+        "precision": 0.846,
+        "recall": 0.688,
+        "fields_scored": 12,
+        "line_item_f1": null,
+        "line_items_applicable": false,
+        "financial_consistent": true,
+        "per_field": {
+          "memo_number": {
+            "exact": true,
+            "pred": "SUB-2026-0145",
+            "gt": "SUB-2026-0145"
+          },
+          "subscription_name": {
+            "exact": false,
+            "pred": "MEMO",
+            "gt": "Store Wi-Fi & Analytics"
+          },
+          "vendor_name": {
+            "exact": true,
+            "pred": "GlobalParts GmbH",
+            "gt": "GlobalParts GmbH"
+          },
+          "account_id": {
+            "exact": true,
+            "pred": "ACC-77310",
+            "gt": "ACC-77310"
+          },
+          "plan": {
+            "exact": true,
+            "pred": "Standard",
+            "gt": "Standard"
+          },
+          "billing_cycle": {
+            "exact": true,
+            "pred": "monthly",
+            "gt": "monthly"
+          },
+          "start_date": {
+            "exact": true,
+            "pred": "2026-01-15",
+            "gt": "2026-01-15"
+          },
+          "renewal_date": {
+            "exact": true,
+            "pred": "2026-07-15",
+            "gt": "2026-07-15"
+          },
+          "amount": {
+            "exact": true,
+            "pred": 2500.0,
+            "gt": 2500.0
+          },
+          "currency": {
+            "exact": true,
+            "pred": "EUR",
+            "gt": "EUR"
+          },
+          "auto_renew": {
+            "exact": false,
+            "pred": true,
+            "gt": false
+          },
+          "status": {
+            "exact": true,
+            "pred": "pending",
+            "gt": "pending"
+          }
+        }
+      }
+    }
+  ],
+  "routing_policy": "offline",
+  "active_tier": "offline"
+}

backend/evals/run.py ADDED Viewed

	@@ -0,0 +1,147 @@

+"""Eval runner.
+Discovers <id>.gt.json files, runs the IDP pipeline on each paired document,
+scores the prediction, and prints a per-type/per-difficulty report. Also writes
+backend/evals/report.json and records rows in the metrics DB (mode='eval') so the
+dashboard's Evals tab renders the same numbers.
+Usage:
+  python -m evals.run                    # full suite (configured router)
+  python -m evals.run --type invoice     # filter by doc type
+  python -m evals.run --policy offline   # force a routing policy
+"""
+from __future__ import annotations
+import argparse
+import json
+import sys
+from pathlib import Path
+# allow `python -m evals.run` from backend/ and `python evals/run.py`
+sys.path.insert(0, str(Path(__file__).resolve().parent.parent))
+from app.config import get_settings  # noqa: E402
+from app.metrics import MetricsStore  # noqa: E402
+from app.pipeline import process_document  # noqa: E402
+from app.providers import build_registry  # noqa: E402
+from app.router import ModelRouter  # noqa: E402
+from evals import scorers  # noqa: E402
+DOC_EXTS = (".pdf", ".png", ".jpg", ".jpeg", ".tif", ".tiff")
+def discover(dataset_dir: Path, type_filter: str | None) -> list[tuple[Path, dict]]:
+    out = []
+    for gt_path in sorted(dataset_dir.glob("*.gt.json")):
+        gt = json.loads(gt_path.read_text())
+        if gt.get("_meta", {}).get("skip_eval"):
+            continue  # showcase-only docs (e.g. the complex form) aren't scored here
+        if type_filter and gt.get("doc_type") != type_filter:
+            continue
+        stem = gt_path.name[: -len(".gt.json")]
+        doc = None
+        for ext in DOC_EXTS:
+            cand = dataset_dir / f"{stem}{ext}"
+            if cand.exists():
+                doc = cand
+                break
+        if doc is None:
+            print(f"  ! no document file for {stem}, skipping")
+            continue
+        out.append((doc, gt))
+    return out
+def run_suite(type_filter: str | None = None, policy: str | None = None) -> dict:
+    settings = get_settings()
+    if policy:
+        settings.routing_policy = policy
+    registry = build_registry(settings)
+    metrics = MetricsStore(settings.metrics_db_path)
+    router = ModelRouter(registry, settings, metrics)
+    cases = discover(settings.evals_dataset_dir, type_filter)
+    if not cases:
+        print("No eval cases found. Run: python scripts/generate_samples.py")
+        return {}
+    results = []
+    for doc_path, gt in cases:
+        meta = gt.get("_meta", {})
+        clean_gt = {k: v for k, v in gt.items() if not k.startswith("_")}
+        run = process_document(
+            doc_path, router=router, settings=settings, metrics=metrics,
+            doc_id=doc_path.stem, channel=meta.get("channel"),
+            difficulty=meta.get("difficulty"), mode="eval",
+            # let the classifier do its job; do NOT force the type (we score it)
+        )
+        pred = run["_state"]["extracted"] or {}
+        score = scorers.score_document(pred, clean_gt)
+        results.append({
+            "doc_id": doc_path.stem,
+            "predicted_type": run["_state"]["doc_type"],
+            "difficulty": meta.get("difficulty", "n/a"),
+            "channel": meta.get("channel", "n/a"),
+            "confidence": run["_state"]["confidence"],
+            "requires_review": run["_state"]["requires_review"],
+            "cost_usd": run["total_cost_usd"],
+            "score": score,
+        })
+    agg = scorers.aggregate(results)
+    report = {"aggregate": agg, "documents": results,
+              "routing_policy": settings.routing_policy,
+              "active_tier": registry.capabilities()["active_tier"]}
+    # Write to a writable location (committed copy locally, /tmp on serverless).
+    for out_path in (settings.eval_report_committed, settings.eval_report_writable):
+        try:
+            out_path.write_text(json.dumps(report, indent=2))
+            break
+        except OSError:
+            continue
+    return report
+def _print(report: dict) -> None:
+    if not report:
+        return
+    agg = report["aggregate"]
+    o = agg["overall"]
+    print("\n" + "=" * 64)
+    print(f" IDP EVAL REPORT  (tier={report['active_tier']}, policy={report['routing_policy']})")
+    print("=" * 64)
+    print(f" documents:            {o['documents']}")
+    print(f" doc-type accuracy:    {_pct(o['doc_type_accuracy'])}")
+    print(f" field exact-match:    {_pct(o['exact_match'])}")
+    print(f" field F1:             {_pct(o['field_f1'])}")
+    print(f" line-item F1:         {_pct(o['line_item_f1'])}")
+    print(f" financial consistency:{_pct(o['financial_consistency_rate'])}")
+    print("-" * 64)
+    print(f" {'by type':<18}{'docs':>5}{'exact':>9}{'F1':>9}{'fin-ok':>9}")
+    for t, g in agg["by_type"].items():
+        print(f" {t:<18}{g['documents']:>5}{_pct(g['exact_match']):>9}"
+              f"{_pct(g['field_f1']):>9}{_pct(g['financial_consistency_rate']):>9}")
+    print("-" * 64)
+    print(f" {'by difficulty':<18}{'docs':>5}{'exact':>9}{'F1':>9}{'fin-ok':>9}")
+    for d, g in agg["by_difficulty"].items():
+        print(f" {d:<18}{g['documents']:>5}{_pct(g['exact_match']):>9}"
+              f"{_pct(g['field_f1']):>9}{_pct(g['financial_consistency_rate']):>9}")
+    print("=" * 64)
+    print(f" report → backend/evals/report.json\n")
+def _pct(v) -> str:
+    return "n/a" if v is None else f"{v*100:.1f}%"
+def main() -> None:
+    ap = argparse.ArgumentParser(description="Run the IDP eval suite.")
+    ap.add_argument("--type", dest="type_filter", default=None)
+    ap.add_argument("--policy", dest="policy", default=None,
+                    choices=["auto", "cheap", "smart", "offline"])
+    args = ap.parse_args()
+    report = run_suite(args.type_filter, args.policy)
+    _print(report)
+if __name__ == "__main__":
+    main()

backend/evals/scorers.py ADDED Viewed

	@@ -0,0 +1,166 @@

+"""Scorers: turn (prediction, ground_truth) into the metrics in docs/EVALS.md."""
+from __future__ import annotations
+import re
+from typing import Any
+# Fields that are scalars we compare for exact match (per doc type).
+SCALAR_FIELDS = {
+    "invoice": ["invoice_number", "issue_date", "due_date", "vendor_name",
+                "bill_to_name", "currency", "subtotal", "tax_amount", "total"],
+    "purchase_order": ["order_number", "order_date", "delivery_date", "vendor_name",
+                       "buyer_name", "ship_to", "currency", "subtotal", "tax_amount",
+                       "total", "payment_terms"],
+    "contract": ["contract_number", "title", "party_a", "party_b", "effective_date",
+                 "expiration_date", "contract_value", "currency", "governing_law",
+                 "auto_renew", "termination_notice_days"],
+    "receipt": ["merchant", "date", "currency", "subtotal", "tax_amount", "total",
+                "payment_method"],
+    "subscription_memo": ["memo_number", "subscription_name", "vendor_name", "account_id",
+                          "plan", "billing_cycle", "start_date", "renewal_date", "amount",
+                          "currency", "auto_renew", "status"],
+}
+def _norm_scalar(v: Any) -> str:
+    if v is None:
+        return ""
+    if isinstance(v, bool):
+        return str(v).lower()
+    if isinstance(v, (int, float)):
+        return f"{float(v):.2f}"
+    s = str(v).strip().lower()
+    s = re.sub(r"[\s,]+", " ", s)
+    # strip currency symbols for value comparison
+    s = s.replace("$", "").replace("€", "").replace("£", "")
+    return s.strip()
+def _tokens(v: Any) -> list[str]:
+    return [t for t in re.split(r"\s+", _norm_scalar(v)) if t]
+def field_scores(pred: dict, gt: dict, doc_type: str) -> dict:
+    """Per-field exact match + aggregate token F1 over scalar fields."""
+    fields = [f for f in SCALAR_FIELDS.get(doc_type, []) if f in gt]
+    exact = 0
+    tp = fp = fn = 0
+    per_field = {}
+    for f in fields:
+        p, g = pred.get(f), gt.get(f)
+        is_exact = _norm_scalar(p) == _norm_scalar(g) and _norm_scalar(g) != ""
+        if _norm_scalar(g) == "":  # gt empty/None — skip from denominator
+            continue
+        exact += int(is_exact)
+        per_field[f] = {"exact": is_exact, "pred": p, "gt": g}
+        pt, gtok = set(_tokens(p)), set(_tokens(g))
+        tp += len(pt & gtok)
+        fp += len(pt - gtok)
+        fn += len(gtok - pt)
+    n = len([f for f in fields if _norm_scalar(gt.get(f)) != ""])
+    prec = tp / (tp + fp) if (tp + fp) else 0.0
+    rec = tp / (tp + fn) if (tp + fn) else 0.0
+    f1 = 2 * prec * rec / (prec + rec) if (prec + rec) else 0.0
+    return {
+        "fields_scored": n,
+        "exact_match": round(exact / n, 3) if n else 0.0,
+        "f1": round(f1, 3),
+        "precision": round(prec, 3),
+        "recall": round(rec, 3),
+        "per_field": per_field,
+    }
+def line_item_f1(pred: dict, gt: dict) -> dict:
+    gi = gt.get("line_items") or []
+    pi = pred.get("line_items") or []
+    if not gi:
+        return {"applicable": False, "f1": None}
+    def key(it):
+        return (
+            _norm_scalar(it.get("description"))[:20],
+            f"{float(it.get('quantity', 0) or 0):.1f}",
+            f"{float(it.get('unit_price', 0) or 0):.2f}",
+        )
+    gset = [key(x) for x in gi]
+    pset = [key(x) for x in pi]
+    matched = 0
+    gpool = list(gset)
+    for k in pset:
+        if k in gpool:
+            matched += 1
+            gpool.remove(k)
+    prec = matched / len(pset) if pset else 0.0
+    rec = matched / len(gset) if gset else 0.0
+    f1 = 2 * prec * rec / (prec + rec) if (prec + rec) else 0.0
+    return {"applicable": True, "f1": round(f1, 3), "matched": matched,
+            "pred_n": len(pset), "gt_n": len(gset)}
+def financial_consistency(pred: dict, doc_type: str) -> bool:
+    from app.schemas import validate_financials
+    vr = validate_financials(pred, doc_type)
+    return vr.checks.get("totals_balance", True) and vr.checks.get("line_items_sum", True)
+def score_document(pred: dict, gt: dict) -> dict:
+    doc_type = gt.get("doc_type") or gt.get("_meta", {}).get("doc_type", "invoice")
+    fs = field_scores(pred, gt, doc_type)
+    li = line_item_f1(pred, gt)
+    fin = financial_consistency(pred, doc_type)
+    return {
+        "doc_type": doc_type,
+        "exact_match": fs["exact_match"],
+        "field_f1": fs["f1"],
+        "precision": fs["precision"],
+        "recall": fs["recall"],
+        "fields_scored": fs["fields_scored"],
+        "line_item_f1": li["f1"],
+        "line_items_applicable": li["applicable"],
+        "financial_consistent": fin,
+        "per_field": fs["per_field"],
+    }
+def aggregate(results: list[dict]) -> dict:
+    """Aggregate per-document scores overall + by type + by difficulty."""
+    def avg(vals):
+        vals = [v for v in vals if v is not None]
+        return round(sum(vals) / len(vals), 3) if vals else None
+    overall = {
+        "documents": len(results),
+        "exact_match": avg([r["score"]["exact_match"] for r in results]),
+        "field_f1": avg([r["score"]["field_f1"] for r in results]),
+        "line_item_f1": avg([r["score"]["line_item_f1"] for r in results
+                             if r["score"]["line_items_applicable"]]),
+        "financial_consistency_rate": avg(
+            [1.0 if r["score"]["financial_consistent"] else 0.0 for r in results]
+        ),
+        "doc_type_accuracy": avg(
+            [1.0 if r["predicted_type"] == r["score"]["doc_type"] else 0.0 for r in results]
+        ),
+    }
+    by_type: dict[str, list] = {}
+    by_diff: dict[str, list] = {}
+    for r in results:
+        by_type.setdefault(r["score"]["doc_type"], []).append(r)
+        by_diff.setdefault(r.get("difficulty", "n/a"), []).append(r)
+    def group(g):
+        return {
+            "documents": len(g),
+            "exact_match": avg([x["score"]["exact_match"] for x in g]),
+            "field_f1": avg([x["score"]["field_f1"] for x in g]),
+            "financial_consistency_rate": avg(
+                [1.0 if x["score"]["financial_consistent"] else 0.0 for x in g]),
+        }
+    return {
+        "overall": overall,
+        "by_type": {k: group(v) for k, v in by_type.items()},
+        "by_difficulty": {k: group(v) for k, v in by_diff.items()},
+    }

backend/finetune/erp_finetune_report.json ADDED Viewed

	@@ -0,0 +1,106 @@

+{
+  "kind": "offline-domain-adaptation",
+  "model": "ERP-NLQ-router (softmax over hashed n-grams, numpy)",
+  "note": "Offline CPU demo of the training loop + eval on the SAME dataset the MiniCPM3-4B LoRA recipe consumes. Trains the NLQ routing head that sits in front of the small model; production fine-tune = OpenBMB MiniCPM3-4B LoRA.",
+  "dataset_size": 120,
+  "train": 96,
+  "test": 24,
+  "n_classes": 10,
+  "trainable_params": 40970,
+  "epochs": 400,
+  "before_test_accuracy": 0.083,
+  "after_test_accuracy": 0.917,
+  "accuracy_gain": 0.833,
+  "routed_sql_exec_rate": 1.0,
+  "loss_curve": [
+    2.3042,
+    2.1773,
+    2.0612,
+    1.9523,
+    1.8493,
+    1.752,
+    1.6601,
+    1.5736,
+    1.4922,
+    1.4159,
+    1.3444,
+    1.2776,
+    1.2152,
+    1.1569,
+    1.1026,
+    1.052,
+    1.0048,
+    0.9607,
+    0.9197,
+    0.8813,
+    0.8456,
+    0.8122,
+    0.7809,
+    0.7517,
+    0.7243,
+    0.6987,
+    0.6746,
+    0.6521,
+    0.6308,
+    0.6109,
+    0.5921,
+    0.5744,
+    0.5577,
+    0.542,
+    0.5271,
+    0.513,
+    0.4997,
+    0.4871,
+    0.4752,
+    0.4638
+  ],
+  "final_loss": 0.4541,
+  "labels": [
+    "spend_by_month",
+    "top_vendors",
+    "late_vendors",
+    "late_rate",
+    "spend_by_category",
+    "why_q2",
+    "below_reorder",
+    "open_invoices",
+    "return_reasons",
+    "ap_health"
+  ],
+  "backend": "local",
+  "dataset_jsonl": "backend/finetune/erp_sft.jsonl",
+  "production_recipe": {
+    "base_model": "openbmb/MiniCPM3-4B",
+    "method": "LoRA (PEFT) supervised fine-tuning (TRL SFTTrainer)",
+    "dataset": "backend/finetune/erp_sft.jsonl",
+    "prompt_template": "{instruction}\n\nERP question: {input}\nSQL:",
+    "hyperparams": {
+      "lora_r": 16,
+      "lora_alpha": 32,
+      "lora_dropout": 0.05,
+      "target_modules": [
+        "q_proj",
+        "k_proj",
+        "v_proj",
+        "o_proj"
+      ],
+      "learning_rate": 0.0002,
+      "num_train_epochs": 3,
+      "per_device_train_batch_size": 8,
+      "gradient_accumulation_steps": 2,
+      "max_seq_length": 1024,
+      "bf16": true
+    },
+    "command": "python scripts/finetune_erp.py --backend hf",
+    "requirements": [
+      "torch",
+      "transformers>=4.44",
+      "peft",
+      "trl",
+      "accelerate",
+      "datasets"
+    ]
+  },
+  "base_model_for_production": "openbmb/MiniCPM3-4B",
+  "generated_at": 1781324653.616286
+}

backend/finetune/erp_sft.jsonl ADDED Viewed

	@@ -0,0 +1,120 @@

+{"task": "nlq", "intent": "reasons", "template": "why_q2", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "Why is Q2 spend up?", "output": "SELECT period, account, ROUND(SUM(amount),2) AS spend FROM gl_entries WHERE period >= '2026-04' AND period <= '2026-06' GROUP BY period, account ORDER BY period, spend DESC"}
+{"task": "nlq", "intent": "summary", "template": "ap_health", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "Summarize our AP position.", "output": "SELECT (SELECT COUNT(*) FROM invoices) AS invoices, (SELECT COUNT(*) FROM invoices WHERE status='open') AS open_invoices, (SELECT ROUND(AVG(days_to_pay),1) FROM invoices WHERE status='paid') AS avg_days_to_pay"}
+{"task": "nlq", "intent": "analytics", "template": "late_vendors", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "Vendors most often paid after due date?", "output": "SELECT v.name, SUM(CASE WHEN i.days_to_pay > CAST(substr(v.payment_terms,5) AS INT) THEN 1 ELSE 0 END) AS late_invoices, COUNT(i.invoice_id) AS paid_invoices FROM invoices i JOIN vendors v ON v.vendor_id=i.vendor_id WHERE i.status='paid' GROUP BY v.vendor_id HAVING late_invoices > 0 ORDER BY late_invoices DESC LIMIT 5"}
+{"task": "nlq", "intent": "analytics", "template": "open_invoices", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "How much do we owe in unpaid invoices?", "output": "SELECT ROUND(SUM(total),2) AS open_value, COUNT(*) AS n FROM invoices WHERE status='open'"}
+{"task": "nlq", "intent": "analytics", "template": "spend_by_month", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "How much did we invoice each month?", "output": "SELECT period, ROUND(SUM(amount),2) AS spend FROM gl_entries GROUP BY period ORDER BY period"}
+{"task": "nlq", "intent": "analytics", "template": "spend_by_category", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "Spend grouped by category.", "output": "SELECT p.category, ROUND(SUM(l.line_total),2) AS spend FROM po_lines l JOIN products p ON p.sku=l.sku JOIN purchase_orders po ON po.po_id=l.po_id WHERE po.status!='cancelled' GROUP BY p.category ORDER BY spend DESC"}
+{"task": "nlq", "intent": "reasons", "template": "why_q2", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "What drove the Q2 spend increase?", "output": "SELECT period, account, ROUND(SUM(amount),2) AS spend FROM gl_entries WHERE period >= '2026-04' AND period <= '2026-06' GROUP BY period, account ORDER BY period, spend DESC"}
+{"task": "nlq", "intent": "analytics", "template": "below_reorder", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "Which products are low on stock?", "output": "SELECT i.sku, p.name, i.region, i.on_hand, i.reorder_point FROM inventory i JOIN products p ON p.sku=i.sku WHERE i.on_hand < i.reorder_point ORDER BY (i.reorder_point - i.on_hand) DESC LIMIT 15"}
+{"task": "nlq", "intent": "summary", "template": "ap_health", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "State of our AP overall?", "output": "SELECT (SELECT COUNT(*) FROM invoices) AS invoices, (SELECT COUNT(*) FROM invoices WHERE status='open') AS open_invoices, (SELECT ROUND(AVG(days_to_pay),1) FROM invoices WHERE status='paid') AS avg_days_to_pay"}
+{"task": "nlq", "intent": "analytics", "template": "spend_by_month", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "What was total invoiced spend by month?", "output": "SELECT period, ROUND(SUM(amount),2) AS spend FROM gl_entries GROUP BY period ORDER BY period"}
+{"task": "nlq", "intent": "analytics", "template": "spend_by_month", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "Show monthly spend.", "output": "SELECT period, ROUND(SUM(amount),2) AS spend FROM gl_entries GROUP BY period ORDER BY period"}
+{"task": "nlq", "intent": "analytics", "template": "spend_by_category", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "Categories ranked by spend.", "output": "SELECT p.category, ROUND(SUM(l.line_total),2) AS spend FROM po_lines l JOIN products p ON p.sku=l.sku JOIN purchase_orders po ON po.po_id=l.po_id WHERE po.status!='cancelled' GROUP BY p.category ORDER BY spend DESC"}
+{"task": "nlq", "intent": "analytics", "template": "spend_by_category", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "Where does spend go by category?", "output": "SELECT p.category, ROUND(SUM(l.line_total),2) AS spend FROM po_lines l JOIN products p ON p.sku=l.sku JOIN purchase_orders po ON po.po_id=l.po_id WHERE po.status!='cancelled' GROUP BY p.category ORDER BY spend DESC"}
+{"task": "nlq", "intent": "analytics", "template": "open_invoices", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "Total open AP balance?", "output": "SELECT ROUND(SUM(total),2) AS open_value, COUNT(*) AS n FROM invoices WHERE status='open'"}
+{"task": "nlq", "intent": "analytics", "template": "late_rate", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "Share of late payments?", "output": "SELECT ROUND(100.0*SUM(CASE WHEN i.days_to_pay > CAST(substr(v.payment_terms,5) AS INT) THEN 1 ELSE 0 END)/COUNT(*),1) AS late_pct, COUNT(*) AS paid_invoices FROM invoices i JOIN vendors v ON v.vendor_id=i.vendor_id WHERE i.status='paid'"}
+{"task": "nlq", "intent": "analytics", "template": "top_vendors", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "Rank vendors by spend.", "output": "SELECT v.name, ROUND(SUM(i.total),2) AS spend, COUNT(*) AS invoices FROM invoices i JOIN vendors v ON v.vendor_id=i.vendor_id GROUP BY v.vendor_id ORDER BY spend DESC LIMIT 5"}
+{"task": "nlq", "intent": "analytics", "template": "late_rate", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "Overall on-time vs late rate?", "output": "SELECT ROUND(100.0*SUM(CASE WHEN i.days_to_pay > CAST(substr(v.payment_terms,5) AS INT) THEN 1 ELSE 0 END)/COUNT(*),1) AS late_pct, COUNT(*) AS paid_invoices FROM invoices i JOIN vendors v ON v.vendor_id=i.vendor_id WHERE i.status='paid'"}
+{"task": "nlq", "intent": "analytics", "template": "top_vendors", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "List our biggest suppliers.", "output": "SELECT v.name, ROUND(SUM(i.total),2) AS spend, COUNT(*) AS invoices FROM invoices i JOIN vendors v ON v.vendor_id=i.vendor_id GROUP BY v.vendor_id ORDER BY spend DESC LIMIT 5"}
+{"task": "nlq", "intent": "analytics", "template": "late_vendors", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "Who are our worst late-paying vendors?", "output": "SELECT v.name, SUM(CASE WHEN i.days_to_pay > CAST(substr(v.payment_terms,5) AS INT) THEN 1 ELSE 0 END) AS late_invoices, COUNT(i.invoice_id) AS paid_invoices FROM invoices i JOIN vendors v ON v.vendor_id=i.vendor_id WHERE i.status='paid' GROUP BY v.vendor_id HAVING late_invoices > 0 ORDER BY late_invoices DESC LIMIT 5"}
+{"task": "nlq", "intent": "analytics", "template": "return_reasons", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "Return reason breakdown by money?", "output": "SELECT reason, COUNT(*) AS returns, ROUND(SUM(refund_amount),2) AS refunds FROM returns GROUP BY reason ORDER BY refunds DESC"}
+{"task": "nlq", "intent": "analytics", "template": "below_reorder", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "What should we reorder?", "output": "SELECT i.sku, p.name, i.region, i.on_hand, i.reorder_point FROM inventory i JOIN products p ON p.sku=i.sku WHERE i.on_hand < i.reorder_point ORDER BY (i.reorder_point - i.on_hand) DESC LIMIT 15"}
+{"task": "nlq", "intent": "analytics", "template": "below_reorder", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "Which items risk stockout?", "output": "SELECT i.sku, p.name, i.region, i.on_hand, i.reorder_point FROM inventory i JOIN products p ON p.sku=i.sku WHERE i.on_hand < i.reorder_point ORDER BY (i.reorder_point - i.on_hand) DESC LIMIT 15"}
+{"task": "nlq", "intent": "analytics", "template": "top_vendors", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "Which vendors get the most of our money?", "output": "SELECT v.name, ROUND(SUM(i.total),2) AS spend, COUNT(*) AS invoices FROM invoices i JOIN vendors v ON v.vendor_id=i.vendor_id GROUP BY v.vendor_id ORDER BY spend DESC LIMIT 5"}
+{"task": "nlq", "intent": "analytics", "template": "return_reasons", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "What are the leading causes of returns?", "output": "SELECT reason, COUNT(*) AS returns, ROUND(SUM(refund_amount),2) AS refunds FROM returns GROUP BY reason ORDER BY refunds DESC"}
+{"task": "nlq", "intent": "analytics", "template": "spend_by_category", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "Show spend by category.", "output": "SELECT p.category, ROUND(SUM(l.line_total),2) AS spend FROM po_lines l JOIN products p ON p.sku=l.sku JOIN purchase_orders po ON po.po_id=l.po_id WHERE po.status!='cancelled' GROUP BY p.category ORDER BY spend DESC"}
+{"task": "nlq", "intent": "analytics", "template": "late_rate", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "Late-payment ratio across all invoices?", "output": "SELECT ROUND(100.0*SUM(CASE WHEN i.days_to_pay > CAST(substr(v.payment_terms,5) AS INT) THEN 1 ELSE 0 END)/COUNT(*),1) AS late_pct, COUNT(*) AS paid_invoices FROM invoices i JOIN vendors v ON v.vendor_id=i.vendor_id WHERE i.status='paid'"}
+{"task": "nlq", "intent": "analytics", "template": "below_reorder", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "What's running low in inventory?", "output": "SELECT i.sku, p.name, i.region, i.on_hand, i.reorder_point FROM inventory i JOIN products p ON p.sku=i.sku WHERE i.on_hand < i.reorder_point ORDER BY (i.reorder_point - i.on_hand) DESC LIMIT 15"}
+{"task": "nlq", "intent": "analytics", "template": "return_reasons", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "Biggest return reasons by refund value?", "output": "SELECT reason, COUNT(*) AS returns, ROUND(SUM(refund_amount),2) AS refunds FROM returns GROUP BY reason ORDER BY refunds DESC"}
+{"task": "nlq", "intent": "summary", "template": "ap_health", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "How healthy are our payables?", "output": "SELECT (SELECT COUNT(*) FROM invoices) AS invoices, (SELECT COUNT(*) FROM invoices WHERE status='open') AS open_invoices, (SELECT ROUND(AVG(days_to_pay),1) FROM invoices WHERE status='paid') AS avg_days_to_pay"}
+{"task": "nlq", "intent": "analytics", "template": "late_vendors", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "Which suppliers have payment delays?", "output": "SELECT v.name, SUM(CASE WHEN i.days_to_pay > CAST(substr(v.payment_terms,5) AS INT) THEN 1 ELSE 0 END) AS late_invoices, COUNT(i.invoice_id) AS paid_invoices FROM invoices i JOIN vendors v ON v.vendor_id=i.vendor_id WHERE i.status='paid' GROUP BY v.vendor_id HAVING late_invoices > 0 ORDER BY late_invoices DESC LIMIT 5"}
+{"task": "nlq", "intent": "analytics", "template": "return_reasons", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "Which return reasons cost the most?", "output": "SELECT reason, COUNT(*) AS returns, ROUND(SUM(refund_amount),2) AS refunds FROM returns GROUP BY reason ORDER BY refunds DESC"}
+{"task": "nlq", "intent": "analytics", "template": "spend_by_month", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "Break spend down per month.", "output": "SELECT period, ROUND(SUM(amount),2) AS spend FROM gl_entries GROUP BY period ORDER BY period"}
+{"task": "nlq", "intent": "analytics", "template": "spend_by_category", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "Category spend totals?", "output": "SELECT p.category, ROUND(SUM(l.line_total),2) AS spend FROM po_lines l JOIN products p ON p.sku=l.sku JOIN purchase_orders po ON po.po_id=l.po_id WHERE po.status!='cancelled' GROUP BY p.category ORDER BY spend DESC"}
+{"task": "nlq", "intent": "analytics", "template": "late_rate", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "Percentage of overdue payments overall?", "output": "SELECT ROUND(100.0*SUM(CASE WHEN i.days_to_pay > CAST(substr(v.payment_terms,5) AS INT) THEN 1 ELSE 0 END)/COUNT(*),1) AS late_pct, COUNT(*) AS paid_invoices FROM invoices i JOIN vendors v ON v.vendor_id=i.vendor_id WHERE i.status='paid'"}
+{"task": "nlq", "intent": "summary", "template": "ap_health", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "AP health check please.", "output": "SELECT (SELECT COUNT(*) FROM invoices) AS invoices, (SELECT COUNT(*) FROM invoices WHERE status='open') AS open_invoices, (SELECT ROUND(AVG(days_to_pay),1) FROM invoices WHERE status='paid') AS avg_days_to_pay"}
+{"task": "nlq", "intent": "analytics", "template": "spend_by_category", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "Which categories cost the most?", "output": "SELECT p.category, ROUND(SUM(l.line_total),2) AS spend FROM po_lines l JOIN products p ON p.sku=l.sku JOIN purchase_orders po ON po.po_id=l.po_id WHERE po.status!='cancelled' GROUP BY p.category ORDER BY spend DESC"}
+{"task": "nlq", "intent": "analytics", "template": "spend_by_month", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "Monthly AP spend totals?", "output": "SELECT period, ROUND(SUM(amount),2) AS spend FROM gl_entries GROUP BY period ORDER BY period"}
+{"task": "nlq", "intent": "reasons", "template": "why_q2", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "What caused the second-quarter cost jump?", "output": "SELECT period, account, ROUND(SUM(amount),2) AS spend FROM gl_entries WHERE period >= '2026-04' AND period <= '2026-06' GROUP BY period, account ORDER BY period, spend DESC"}
+{"task": "nlq", "intent": "analytics", "template": "late_vendors", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "Who keeps getting paid past terms?", "output": "SELECT v.name, SUM(CASE WHEN i.days_to_pay > CAST(substr(v.payment_terms,5) AS INT) THEN 1 ELSE 0 END) AS late_invoices, COUNT(i.invoice_id) AS paid_invoices FROM invoices i JOIN vendors v ON v.vendor_id=i.vendor_id WHERE i.status='paid' GROUP BY v.vendor_id HAVING late_invoices > 0 ORDER BY late_invoices DESC LIMIT 5"}
+{"task": "nlq", "intent": "reasons", "template": "why_q2", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "Account for the Q2 2026 spend surge.", "output": "SELECT period, account, ROUND(SUM(amount),2) AS spend FROM gl_entries WHERE period >= '2026-04' AND period <= '2026-06' GROUP BY period, account ORDER BY period, spend DESC"}
+{"task": "nlq", "intent": "summary", "template": "ap_health", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "What's the state of accounts payable?", "output": "SELECT (SELECT COUNT(*) FROM invoices) AS invoices, (SELECT COUNT(*) FROM invoices WHERE status='open') AS open_invoices, (SELECT ROUND(AVG(days_to_pay),1) FROM invoices WHERE status='paid') AS avg_days_to_pay"}
+{"task": "nlq", "intent": "analytics", "template": "late_rate", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "Our overall late payment percentage?", "output": "SELECT ROUND(100.0*SUM(CASE WHEN i.days_to_pay > CAST(substr(v.payment_terms,5) AS INT) THEN 1 ELSE 0 END)/COUNT(*),1) AS late_pct, COUNT(*) AS paid_invoices FROM invoices i JOIN vendors v ON v.vendor_id=i.vendor_id WHERE i.status='paid'"}
+{"task": "nlq", "intent": "analytics", "template": "below_reorder", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "Inventory below reorder point?", "output": "SELECT i.sku, p.name, i.region, i.on_hand, i.reorder_point FROM inventory i JOIN products p ON p.sku=i.sku WHERE i.on_hand < i.reorder_point ORDER BY (i.reorder_point - i.on_hand) DESC LIMIT 15"}
+{"task": "nlq", "intent": "analytics", "template": "open_invoices", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "What is the total value of open invoices?", "output": "SELECT ROUND(SUM(total),2) AS open_value, COUNT(*) AS n FROM invoices WHERE status='open'"}
+{"task": "nlq", "intent": "analytics", "template": "top_vendors", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "Top suppliers ranked by spend.", "output": "SELECT v.name, ROUND(SUM(i.total),2) AS spend, COUNT(*) AS invoices FROM invoices i JOIN vendors v ON v.vendor_id=i.vendor_id GROUP BY v.vendor_id ORDER BY spend DESC LIMIT 5"}
+{"task": "nlq", "intent": "analytics", "template": "late_vendors", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "List vendors by late-payment count.", "output": "SELECT v.name, SUM(CASE WHEN i.days_to_pay > CAST(substr(v.payment_terms,5) AS INT) THEN 1 ELSE 0 END) AS late_invoices, COUNT(i.invoice_id) AS paid_invoices FROM invoices i JOIN vendors v ON v.vendor_id=i.vendor_id WHERE i.status='paid' GROUP BY v.vendor_id HAVING late_invoices > 0 ORDER BY late_invoices DESC LIMIT 5"}
+{"task": "nlq", "intent": "analytics", "template": "late_vendors", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "Late payers among our vendors?", "output": "SELECT v.name, SUM(CASE WHEN i.days_to_pay > CAST(substr(v.payment_terms,5) AS INT) THEN 1 ELSE 0 END) AS late_invoices, COUNT(i.invoice_id) AS paid_invoices FROM invoices i JOIN vendors v ON v.vendor_id=i.vendor_id WHERE i.status='paid' GROUP BY v.vendor_id HAVING late_invoices > 0 ORDER BY late_invoices DESC LIMIT 5"}
+{"task": "nlq", "intent": "analytics", "template": "spend_by_category", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "Total spend for each category?", "output": "SELECT p.category, ROUND(SUM(l.line_total),2) AS spend FROM po_lines l JOIN products p ON p.sku=l.sku JOIN purchase_orders po ON po.po_id=l.po_id WHERE po.status!='cancelled' GROUP BY p.category ORDER BY spend DESC"}
+{"task": "nlq", "intent": "analytics", "template": "late_rate", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "What is the late-payment rate overall?", "output": "SELECT ROUND(100.0*SUM(CASE WHEN i.days_to_pay > CAST(substr(v.payment_terms,5) AS INT) THEN 1 ELSE 0 END)/COUNT(*),1) AS late_pct, COUNT(*) AS paid_invoices FROM invoices i JOIN vendors v ON v.vendor_id=i.vendor_id WHERE i.status='paid'"}
+{"task": "nlq", "intent": "summary", "template": "ap_health", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "How are we doing on accounts payable?", "output": "SELECT (SELECT COUNT(*) FROM invoices) AS invoices, (SELECT COUNT(*) FROM invoices WHERE status='open') AS open_invoices, (SELECT ROUND(AVG(days_to_pay),1) FROM invoices WHERE status='paid') AS avg_days_to_pay"}
+{"task": "nlq", "intent": "analytics", "template": "return_reasons", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "Show returns grouped by reason.", "output": "SELECT reason, COUNT(*) AS returns, ROUND(SUM(refund_amount),2) AS refunds FROM returns GROUP BY reason ORDER BY refunds DESC"}
+{"task": "nlq", "intent": "reasons", "template": "why_q2", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "Explain the spend spike in Q2.", "output": "SELECT period, account, ROUND(SUM(amount),2) AS spend FROM gl_entries WHERE period >= '2026-04' AND period <= '2026-06' GROUP BY period, account ORDER BY period, spend DESC"}
+{"task": "nlq", "intent": "analytics", "template": "late_rate", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "What's our late payment rate?", "output": "SELECT ROUND(100.0*SUM(CASE WHEN i.days_to_pay > CAST(substr(v.payment_terms,5) AS INT) THEN 1 ELSE 0 END)/COUNT(*),1) AS late_pct, COUNT(*) AS paid_invoices FROM invoices i JOIN vendors v ON v.vendor_id=i.vendor_id WHERE i.status='paid'"}
+{"task": "nlq", "intent": "analytics", "template": "return_reasons", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "Break down returns by reason.", "output": "SELECT reason, COUNT(*) AS returns, ROUND(SUM(refund_amount),2) AS refunds FROM returns GROUP BY reason ORDER BY refunds DESC"}
+{"task": "nlq", "intent": "analytics", "template": "open_invoices", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "Outstanding invoice value?", "output": "SELECT ROUND(SUM(total),2) AS open_value, COUNT(*) AS n FROM invoices WHERE status='open'"}
+{"task": "nlq", "intent": "analytics", "template": "below_reorder", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "List SKUs below reorder threshold.", "output": "SELECT i.sku, p.name, i.region, i.on_hand, i.reorder_point FROM inventory i JOIN products p ON p.sku=i.sku WHERE i.on_hand < i.reorder_point ORDER BY (i.reorder_point - i.on_hand) DESC LIMIT 15"}
+{"task": "nlq", "intent": "summary", "template": "ap_health", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "Give an executive AP summary.", "output": "SELECT (SELECT COUNT(*) FROM invoices) AS invoices, (SELECT COUNT(*) FROM invoices WHERE status='open') AS open_invoices, (SELECT ROUND(AVG(days_to_pay),1) FROM invoices WHERE status='paid') AS avg_days_to_pay"}
+{"task": "nlq", "intent": "analytics", "template": "open_invoices", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "Total of invoices not yet paid?", "output": "SELECT ROUND(SUM(total),2) AS open_value, COUNT(*) AS n FROM invoices WHERE status='open'"}
+{"task": "nlq", "intent": "reasons", "template": "why_q2", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "Why did spend rise in Q2 2026?", "output": "SELECT period, account, ROUND(SUM(amount),2) AS spend FROM gl_entries WHERE period >= '2026-04' AND period <= '2026-06' GROUP BY period, account ORDER BY period, spend DESC"}
+{"task": "nlq", "intent": "analytics", "template": "open_invoices", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "What's our outstanding payables total?", "output": "SELECT ROUND(SUM(total),2) AS open_value, COUNT(*) AS n FROM invoices WHERE status='open'"}
+{"task": "nlq", "intent": "analytics", "template": "late_vendors", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "Worst offenders for late payment?", "output": "SELECT v.name, SUM(CASE WHEN i.days_to_pay > CAST(substr(v.payment_terms,5) AS INT) THEN 1 ELSE 0 END) AS late_invoices, COUNT(i.invoice_id) AS paid_invoices FROM invoices i JOIN vendors v ON v.vendor_id=i.vendor_id WHERE i.status='paid' GROUP BY v.vendor_id HAVING late_invoices > 0 ORDER BY late_invoices DESC LIMIT 5"}
+{"task": "nlq", "intent": "analytics", "template": "top_vendors", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "Our highest-spend vendors?", "output": "SELECT v.name, ROUND(SUM(i.total),2) AS spend, COUNT(*) AS invoices FROM invoices i JOIN vendors v ON v.vendor_id=i.vendor_id GROUP BY v.vendor_id ORDER BY spend DESC LIMIT 5"}
+{"task": "nlq", "intent": "analytics", "template": "spend_by_category", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "How much do we spend per product category?", "output": "SELECT p.category, ROUND(SUM(l.line_total),2) AS spend FROM po_lines l JOIN products p ON p.sku=l.sku JOIN purchase_orders po ON po.po_id=l.po_id WHERE po.status!='cancelled' GROUP BY p.category ORDER BY spend DESC"}
+{"task": "nlq", "intent": "analytics", "template": "return_reasons", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "Why are products being returned?", "output": "SELECT reason, COUNT(*) AS returns, ROUND(SUM(refund_amount),2) AS refunds FROM returns GROUP BY reason ORDER BY refunds DESC"}
+{"task": "nlq", "intent": "reasons", "template": "why_q2", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "Reason for higher spending in Q2 2026?", "output": "SELECT period, account, ROUND(SUM(amount),2) AS spend FROM gl_entries WHERE period >= '2026-04' AND period <= '2026-06' GROUP BY period, account ORDER BY period, spend DESC"}
+{"task": "nlq", "intent": "analytics", "template": "late_vendors", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "Vendors with the most overdue payments?", "output": "SELECT v.name, SUM(CASE WHEN i.days_to_pay > CAST(substr(v.payment_terms,5) AS INT) THEN 1 ELSE 0 END) AS late_invoices, COUNT(i.invoice_id) AS paid_invoices FROM invoices i JOIN vendors v ON v.vendor_id=i.vendor_id WHERE i.status='paid' GROUP BY v.vendor_id HAVING late_invoices > 0 ORDER BY late_invoices DESC LIMIT 5"}
+{"task": "nlq", "intent": "analytics", "template": "open_invoices", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "How much is still open in payables?", "output": "SELECT ROUND(SUM(total),2) AS open_value, COUNT(*) AS n FROM invoices WHERE status='open'"}
+{"task": "nlq", "intent": "analytics", "template": "return_reasons", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "Return reasons ranked by refunds?", "output": "SELECT reason, COUNT(*) AS returns, ROUND(SUM(refund_amount),2) AS refunds FROM returns GROUP BY reason ORDER BY refunds DESC"}
+{"task": "nlq", "intent": "summary", "template": "ap_health", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "Overall accounts payable summary?", "output": "SELECT (SELECT COUNT(*) FROM invoices) AS invoices, (SELECT COUNT(*) FROM invoices WHERE status='open') AS open_invoices, (SELECT ROUND(AVG(days_to_pay),1) FROM invoices WHERE status='paid') AS avg_days_to_pay"}
+{"task": "nlq", "intent": "analytics", "template": "late_rate", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "Give the global late payment percentage.", "output": "SELECT ROUND(100.0*SUM(CASE WHEN i.days_to_pay > CAST(substr(v.payment_terms,5) AS INT) THEN 1 ELSE 0 END)/COUNT(*),1) AS late_pct, COUNT(*) AS paid_invoices FROM invoices i JOIN vendors v ON v.vendor_id=i.vendor_id WHERE i.status='paid'"}
+{"task": "nlq", "intent": "analytics", "template": "late_vendors", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "Which vendors paid late most often?", "output": "SELECT v.name, SUM(CASE WHEN i.days_to_pay > CAST(substr(v.payment_terms,5) AS INT) THEN 1 ELSE 0 END) AS late_invoices, COUNT(i.invoice_id) AS paid_invoices FROM invoices i JOIN vendors v ON v.vendor_id=i.vendor_id WHERE i.status='paid' GROUP BY v.vendor_id HAVING late_invoices > 0 ORDER BY late_invoices DESC LIMIT 5"}
+{"task": "nlq", "intent": "analytics", "template": "top_vendors", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "Which vendors do we spend the most with?", "output": "SELECT v.name, ROUND(SUM(i.total),2) AS spend, COUNT(*) AS invoices FROM invoices i JOIN vendors v ON v.vendor_id=i.vendor_id GROUP BY v.vendor_id ORDER BY spend DESC LIMIT 5"}
+{"task": "nlq", "intent": "analytics", "template": "top_vendors", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "Largest vendors please.", "output": "SELECT v.name, ROUND(SUM(i.total),2) AS spend, COUNT(*) AS invoices FROM invoices i JOIN vendors v ON v.vendor_id=i.vendor_id GROUP BY v.vendor_id ORDER BY spend DESC LIMIT 5"}
+{"task": "nlq", "intent": "analytics", "template": "below_reorder", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "SKUs needing replenishment?", "output": "SELECT i.sku, p.name, i.region, i.on_hand, i.reorder_point FROM inventory i JOIN products p ON p.sku=i.sku WHERE i.on_hand < i.reorder_point ORDER BY (i.reorder_point - i.on_hand) DESC LIMIT 15"}
+{"task": "nlq", "intent": "analytics", "template": "late_rate", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "How often do we pay late, as a rate?", "output": "SELECT ROUND(100.0*SUM(CASE WHEN i.days_to_pay > CAST(substr(v.payment_terms,5) AS INT) THEN 1 ELSE 0 END)/COUNT(*),1) AS late_pct, COUNT(*) AS paid_invoices FROM invoices i JOIN vendors v ON v.vendor_id=i.vendor_id WHERE i.status='paid'"}
+{"task": "nlq", "intent": "analytics", "template": "open_invoices", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "Value of unpaid invoices?", "output": "SELECT ROUND(SUM(total),2) AS open_value, COUNT(*) AS n FROM invoices WHERE status='open'"}
+{"task": "nlq", "intent": "analytics", "template": "open_invoices", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "How much AP is still open?", "output": "SELECT ROUND(SUM(total),2) AS open_value, COUNT(*) AS n FROM invoices WHERE status='open'"}
+{"task": "nlq", "intent": "reasons", "template": "why_q2", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "Drivers of the Q2 spend rise?", "output": "SELECT period, account, ROUND(SUM(amount),2) AS spend FROM gl_entries WHERE period >= '2026-04' AND period <= '2026-06' GROUP BY period, account ORDER BY period, spend DESC"}
+{"task": "nlq", "intent": "analytics", "template": "top_vendors", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "Show the five largest vendors.", "output": "SELECT v.name, ROUND(SUM(i.total),2) AS spend, COUNT(*) AS invoices FROM invoices i JOIN vendors v ON v.vendor_id=i.vendor_id GROUP BY v.vendor_id ORDER BY spend DESC LIMIT 5"}
+{"task": "nlq", "intent": "analytics", "template": "return_reasons", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "What drives our refunds?", "output": "SELECT reason, COUNT(*) AS returns, ROUND(SUM(refund_amount),2) AS refunds FROM returns GROUP BY reason ORDER BY refunds DESC"}
+{"task": "nlq", "intent": "analytics", "template": "late_rate", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "What percent of invoices are paid late?", "output": "SELECT ROUND(100.0*SUM(CASE WHEN i.days_to_pay > CAST(substr(v.payment_terms,5) AS INT) THEN 1 ELSE 0 END)/COUNT(*),1) AS late_pct, COUNT(*) AS paid_invoices FROM invoices i JOIN vendors v ON v.vendor_id=i.vendor_id WHERE i.status='paid'"}
+{"task": "nlq", "intent": "analytics", "template": "top_vendors", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "Which suppliers cost us the most?", "output": "SELECT v.name, ROUND(SUM(i.total),2) AS spend, COUNT(*) AS invoices FROM invoices i JOIN vendors v ON v.vendor_id=i.vendor_id GROUP BY v.vendor_id ORDER BY spend DESC LIMIT 5"}
+{"task": "nlq", "intent": "reasons", "template": "why_q2", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "Explain why Q2 spend went up.", "output": "SELECT period, account, ROUND(SUM(amount),2) AS spend FROM gl_entries WHERE period >= '2026-04' AND period <= '2026-06' GROUP BY period, account ORDER BY period, spend DESC"}
+{"task": "nlq", "intent": "analytics", "template": "spend_by_month", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "Spend by period please.", "output": "SELECT period, ROUND(SUM(amount),2) AS spend FROM gl_entries GROUP BY period ORDER BY period"}
+{"task": "nlq", "intent": "analytics", "template": "open_invoices", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "Open invoice liability?", "output": "SELECT ROUND(SUM(total),2) AS open_value, COUNT(*) AS n FROM invoices WHERE status='open'"}
+{"task": "nlq", "intent": "summary", "template": "ap_health", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "Overview of payables health?", "output": "SELECT (SELECT COUNT(*) FROM invoices) AS invoices, (SELECT COUNT(*) FROM invoices WHERE status='open') AS open_invoices, (SELECT ROUND(AVG(days_to_pay),1) FROM invoices WHERE status='paid') AS avg_days_to_pay"}
+{"task": "nlq", "intent": "summary", "template": "ap_health", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "Summarize payables status.", "output": "SELECT (SELECT COUNT(*) FROM invoices) AS invoices, (SELECT COUNT(*) FROM invoices WHERE status='open') AS open_invoices, (SELECT ROUND(AVG(days_to_pay),1) FROM invoices WHERE status='paid') AS avg_days_to_pay"}
+{"task": "nlq", "intent": "analytics", "template": "open_invoices", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "Sum of open invoices?", "output": "SELECT ROUND(SUM(total),2) AS open_value, COUNT(*) AS n FROM invoices WHERE status='open'"}
+{"task": "nlq", "intent": "analytics", "template": "below_reorder", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "What needs replenishing?", "output": "SELECT i.sku, p.name, i.region, i.on_hand, i.reorder_point FROM inventory i JOIN products p ON p.sku=i.sku WHERE i.on_hand < i.reorder_point ORDER BY (i.reorder_point - i.on_hand) DESC LIMIT 15"}
+{"task": "nlq", "intent": "analytics", "template": "return_reasons", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "Top return reasons by refund amount?", "output": "SELECT reason, COUNT(*) AS returns, ROUND(SUM(refund_amount),2) AS refunds FROM returns GROUP BY reason ORDER BY refunds DESC"}
+{"task": "nlq", "intent": "summary", "template": "ap_health", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "Give me an AP health overview.", "output": "SELECT (SELECT COUNT(*) FROM invoices) AS invoices, (SELECT COUNT(*) FROM invoices WHERE status='open') AS open_invoices, (SELECT ROUND(AVG(days_to_pay),1) FROM invoices WHERE status='paid') AS avg_days_to_pay"}
+{"task": "nlq", "intent": "analytics", "template": "open_invoices", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "Unpaid invoice amount overall?", "output": "SELECT ROUND(SUM(total),2) AS open_value, COUNT(*) AS n FROM invoices WHERE status='open'"}
+{"task": "nlq", "intent": "analytics", "template": "below_reorder", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "Stock positions under reorder point?", "output": "SELECT i.sku, p.name, i.region, i.on_hand, i.reorder_point FROM inventory i JOIN products p ON p.sku=i.sku WHERE i.on_hand < i.reorder_point ORDER BY (i.reorder_point - i.on_hand) DESC LIMIT 15"}
+{"task": "nlq", "intent": "analytics", "template": "late_vendors", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "Show vendors with frequent late payments.", "output": "SELECT v.name, SUM(CASE WHEN i.days_to_pay > CAST(substr(v.payment_terms,5) AS INT) THEN 1 ELSE 0 END) AS late_invoices, COUNT(i.invoice_id) AS paid_invoices FROM invoices i JOIN vendors v ON v.vendor_id=i.vendor_id WHERE i.status='paid' GROUP BY v.vendor_id HAVING late_invoices > 0 ORDER BY late_invoices DESC LIMIT 5"}
+{"task": "nlq", "intent": "analytics", "template": "top_vendors", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "Top vendors by total spend?", "output": "SELECT v.name, ROUND(SUM(i.total),2) AS spend, COUNT(*) AS invoices FROM invoices i JOIN vendors v ON v.vendor_id=i.vendor_id GROUP BY v.vendor_id ORDER BY spend DESC LIMIT 5"}
+{"task": "nlq", "intent": "analytics", "template": "below_reorder", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "Which SKUs are below reorder point?", "output": "SELECT i.sku, p.name, i.region, i.on_hand, i.reorder_point FROM inventory i JOIN products p ON p.sku=i.sku WHERE i.on_hand < i.reorder_point ORDER BY (i.reorder_point - i.on_hand) DESC LIMIT 15"}
+{"task": "nlq", "intent": "summary", "template": "ap_health", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "Summarize accounts payable health.", "output": "SELECT (SELECT COUNT(*) FROM invoices) AS invoices, (SELECT COUNT(*) FROM invoices WHERE status='open') AS open_invoices, (SELECT ROUND(AVG(days_to_pay),1) FROM invoices WHERE status='paid') AS avg_days_to_pay"}
+{"task": "nlq", "intent": "analytics", "template": "spend_by_category", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "Spend per category please.", "output": "SELECT p.category, ROUND(SUM(l.line_total),2) AS spend FROM po_lines l JOIN products p ON p.sku=l.sku JOIN purchase_orders po ON po.po_id=l.po_id WHERE po.status!='cancelled' GROUP BY p.category ORDER BY spend DESC"}
+{"task": "nlq", "intent": "reasons", "template": "why_q2", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "Why did costs climb in Q2 2026?", "output": "SELECT period, account, ROUND(SUM(amount),2) AS spend FROM gl_entries WHERE period >= '2026-04' AND period <= '2026-06' GROUP BY period, account ORDER BY period, spend DESC"}
+{"task": "nlq", "intent": "analytics", "template": "return_reasons", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "Most costly return reasons?", "output": "SELECT reason, COUNT(*) AS returns, ROUND(SUM(refund_amount),2) AS refunds FROM returns GROUP BY reason ORDER BY refunds DESC"}
+{"task": "nlq", "intent": "analytics", "template": "late_vendors", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "Which vendors are habitually overdue?", "output": "SELECT v.name, SUM(CASE WHEN i.days_to_pay > CAST(substr(v.payment_terms,5) AS INT) THEN 1 ELSE 0 END) AS late_invoices, COUNT(i.invoice_id) AS paid_invoices FROM invoices i JOIN vendors v ON v.vendor_id=i.vendor_id WHERE i.status='paid' GROUP BY v.vendor_id HAVING late_invoices > 0 ORDER BY late_invoices DESC LIMIT 5"}
+{"task": "nlq", "intent": "analytics", "template": "spend_by_month", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "What's our spend over the months?", "output": "SELECT period, ROUND(SUM(amount),2) AS spend FROM gl_entries GROUP BY period ORDER BY period"}
+{"task": "nlq", "intent": "analytics", "template": "spend_by_category", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "Break down spend across categories.", "output": "SELECT p.category, ROUND(SUM(l.line_total),2) AS spend FROM po_lines l JOIN products p ON p.sku=l.sku JOIN purchase_orders po ON po.po_id=l.po_id WHERE po.status!='cancelled' GROUP BY p.category ORDER BY spend DESC"}
+{"task": "nlq", "intent": "analytics", "template": "spend_by_category", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "What's our category spend mix?", "output": "SELECT p.category, ROUND(SUM(l.line_total),2) AS spend FROM po_lines l JOIN products p ON p.sku=l.sku JOIN purchase_orders po ON po.po_id=l.po_id WHERE po.status!='cancelled' GROUP BY p.category ORDER BY spend DESC"}
+{"task": "nlq", "intent": "analytics", "template": "spend_by_month", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "How has spend trended month to month?", "output": "SELECT period, ROUND(SUM(amount),2) AS spend FROM gl_entries GROUP BY period ORDER BY period"}
+{"task": "nlq", "intent": "analytics", "template": "spend_by_month", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "Monthly invoiced spend trend?", "output": "SELECT period, ROUND(SUM(amount),2) AS spend FROM gl_entries GROUP BY period ORDER BY period"}
+{"task": "nlq", "intent": "analytics", "template": "late_vendors", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "Which suppliers do we pay late?", "output": "SELECT v.name, SUM(CASE WHEN i.days_to_pay > CAST(substr(v.payment_terms,5) AS INT) THEN 1 ELSE 0 END) AS late_invoices, COUNT(i.invoice_id) AS paid_invoices FROM invoices i JOIN vendors v ON v.vendor_id=i.vendor_id WHERE i.status='paid' GROUP BY v.vendor_id HAVING late_invoices > 0 ORDER BY late_invoices DESC LIMIT 5"}
+{"task": "nlq", "intent": "reasons", "template": "why_q2", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "Why was Q2 so expensive?", "output": "SELECT period, account, ROUND(SUM(amount),2) AS spend FROM gl_entries WHERE period >= '2026-04' AND period <= '2026-06' GROUP BY period, account ORDER BY period, spend DESC"}
+{"task": "nlq", "intent": "analytics", "template": "spend_by_month", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "Total spend grouped by month.", "output": "SELECT period, ROUND(SUM(amount),2) AS spend FROM gl_entries GROUP BY period ORDER BY period"}
+{"task": "nlq", "intent": "analytics", "template": "below_reorder", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "Show items under their reorder level.", "output": "SELECT i.sku, p.name, i.region, i.on_hand, i.reorder_point FROM inventory i JOIN products p ON p.sku=i.sku WHERE i.on_hand < i.reorder_point ORDER BY (i.reorder_point - i.on_hand) DESC LIMIT 15"}
+{"task": "nlq", "intent": "analytics", "template": "late_rate", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "How bad is our late-payment rate?", "output": "SELECT ROUND(100.0*SUM(CASE WHEN i.days_to_pay > CAST(substr(v.payment_terms,5) AS INT) THEN 1 ELSE 0 END)/COUNT(*),1) AS late_pct, COUNT(*) AS paid_invoices FROM invoices i JOIN vendors v ON v.vendor_id=i.vendor_id WHERE i.status='paid'"}
+{"task": "nlq", "intent": "analytics", "template": "top_vendors", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "Who are the top 5 vendors by spend?", "output": "SELECT v.name, ROUND(SUM(i.total),2) AS spend, COUNT(*) AS invoices FROM invoices i JOIN vendors v ON v.vendor_id=i.vendor_id GROUP BY v.vendor_id ORDER BY spend DESC LIMIT 5"}
+{"task": "nlq", "intent": "reasons", "template": "why_q2", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "What's behind the Q2 increase?", "output": "SELECT period, account, ROUND(SUM(amount),2) AS spend FROM gl_entries WHERE period >= '2026-04' AND period <= '2026-06' GROUP BY period, account ORDER BY period, spend DESC"}
+{"task": "nlq", "intent": "analytics", "template": "return_reasons", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "Refund totals per return reason?", "output": "SELECT reason, COUNT(*) AS returns, ROUND(SUM(refund_amount),2) AS refunds FROM returns GROUP BY reason ORDER BY refunds DESC"}
+{"task": "nlq", "intent": "analytics", "template": "spend_by_month", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "Plot spend per month.", "output": "SELECT period, ROUND(SUM(amount),2) AS spend FROM gl_entries GROUP BY period ORDER BY period"}
+{"task": "nlq", "intent": "analytics", "template": "spend_by_month", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "Give me the month-by-month spend.", "output": "SELECT period, ROUND(SUM(amount),2) AS spend FROM gl_entries GROUP BY period ORDER BY period"}
+{"task": "nlq", "intent": "analytics", "template": "below_reorder", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "Which products fell below reorder?", "output": "SELECT i.sku, p.name, i.region, i.on_hand, i.reorder_point FROM inventory i JOIN products p ON p.sku=i.sku WHERE i.on_hand < i.reorder_point ORDER BY (i.reorder_point - i.on_hand) DESC LIMIT 15"}
+{"task": "nlq", "intent": "analytics", "template": "spend_by_category", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "Category-level spend breakdown?", "output": "SELECT p.category, ROUND(SUM(l.line_total),2) AS spend FROM po_lines l JOIN products p ON p.sku=l.sku JOIN purchase_orders po ON po.po_id=l.po_id WHERE po.status!='cancelled' GROUP BY p.category ORDER BY spend DESC"}
+{"task": "nlq", "intent": "analytics", "template": "top_vendors", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "Biggest vendors by invoice value?", "output": "SELECT v.name, ROUND(SUM(i.total),2) AS spend, COUNT(*) AS invoices FROM invoices i JOIN vendors v ON v.vendor_id=i.vendor_id GROUP BY v.vendor_id ORDER BY spend DESC LIMIT 5"}
+{"task": "nlq", "intent": "analytics", "template": "late_rate", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "What fraction of payments miss terms?", "output": "SELECT ROUND(100.0*SUM(CASE WHEN i.days_to_pay > CAST(substr(v.payment_terms,5) AS INT) THEN 1 ELSE 0 END)/COUNT(*),1) AS late_pct, COUNT(*) AS paid_invoices FROM invoices i JOIN vendors v ON v.vendor_id=i.vendor_id WHERE i.status='paid'"}

backend/finetune/runs/hf_20260612T212346.json ADDED Viewed

	@@ -0,0 +1,120 @@

+{
+  "backend": "hf",
+  "ran": false,
+  "reason": "training stack unavailable (No module named 'torch')",
+  "recipe": {
+    "base_model": "openbmb/MiniCPM3-4B",
+    "method": "LoRA (PEFT) supervised fine-tuning (TRL SFTTrainer)",
+    "dataset": "backend/finetune/erp_sft.jsonl",
+    "prompt_template": "{instruction}\n\nERP question: {input}\nSQL:",
+    "hyperparams": {
+      "lora_r": 16,
+      "lora_alpha": 32,
+      "lora_dropout": 0.05,
+      "target_modules": [
+        "q_proj",
+        "k_proj",
+        "v_proj",
+        "o_proj"
+      ],
+      "learning_rate": 0.0002,
+      "num_train_epochs": 3,
+      "per_device_train_batch_size": 8,
+      "gradient_accumulation_steps": 2,
+      "max_seq_length": 1024,
+      "bf16": true
+    },
+    "command": "python scripts/finetune_erp.py --backend hf",
+    "requirements": [
+      "torch",
+      "transformers>=4.44",
+      "peft",
+      "trl",
+      "accelerate",
+      "datasets"
+    ]
+  },
+  "note": "Dataset + recipe are ready; launch on a GPU box to fine-tune MiniCPM3-4B.",
+  "offline_demo": {
+    "kind": "offline-domain-adaptation",
+    "model": "ERP-NLQ-router (softmax over hashed n-grams, numpy)",
+    "note": "Offline CPU demo of the training loop + eval on the SAME dataset the MiniCPM3-4B LoRA recipe consumes. Trains the NLQ routing head that sits in front of the small model; production fine-tune = OpenBMB MiniCPM3-4B LoRA.",
+    "dataset_size": 120,
+    "train": 96,
+    "test": 24,
+    "n_classes": 10,
+    "trainable_params": 40970,
+    "epochs": 50,
+    "before_test_accuracy": 0.083,
+    "after_test_accuracy": 0.583,
+    "accuracy_gain": 0.5,
+    "routed_sql_exec_rate": 1.0,
+    "loss_curve": [
+      2.3042,
+      2.2908,
+      2.2776,
+      2.2646,
+      2.2517,
+      2.239,
+      2.2264,
+      2.214,
+      2.2016,
+      2.1894,
+      2.1773,
+      2.1653,
+      2.1534,
+      2.1416,
+      2.1299,
+      2.1183,
+      2.1067,
+      2.0952,
+      2.0838,
+      2.0725,
+      2.0612,
+      2.0501,
+      2.0389,
+      2.0279,
+      2.0169,
+      2.006,
+      1.9951,
+      1.9843,
+      1.9736,
+      1.9629,
+      1.9523,
+      1.9417,
+      1.9312,
+      1.9208,
+      1.9104,
+      1.9001,
+      1.8898,
+      1.8796,
+      1.8695,
+      1.8594,
+      1.8493,
+      1.8394,
+      1.8294,
+      1.8196,
+      1.8097,
+      1.8,
+      1.7903,
+      1.7806,
+      1.771,
+      1.7615
+    ],
+    "final_loss": 1.7615,
+    "labels": [
+      "spend_by_month",
+      "top_vendors",
+      "late_vendors",
+      "late_rate",
+      "spend_by_category",
+      "why_q2",
+      "below_reorder",
+      "open_invoices",
+      "return_reasons",
+      "ap_health"
+    ]
+  },
+  "base_model_for_production": "openbmb/MiniCPM3-4B",
+  "generated_at": 1781324626.2022731
+}

backend/finetune/runs/local_20260612T212257.json ADDED Viewed

	@@ -0,0 +1,108 @@

+{
+  "kind": "offline-domain-adaptation",
+  "model": "ERP-NLQ-router (softmax over hashed n-grams, numpy)",
+  "note": "Offline CPU demo of the training loop + eval on the SAME dataset the MiniCPM3-4B LoRA recipe consumes. Trains the NLQ routing head that sits in front of the small model; production fine-tune = OpenBMB MiniCPM3-4B LoRA.",
+  "dataset_size": 120,
+  "train": 96,
+  "test": 24,
+  "n_classes": 10,
+  "trainable_params": 40970,
+  "epochs": 250,
+  "before_test_accuracy": 0.0,
+  "after_test_accuracy": 0.875,
+  "accuracy_gain": 0.875,
+  "routed_sql_exec_rate": 1.0,
+  "loss_curve": [
+    2.3026,
+    2.2249,
+    2.1519,
+    2.0824,
+    2.0155,
+    1.9509,
+    1.8884,
+    1.828,
+    1.7697,
+    1.7132,
+    1.6588,
+    1.6062,
+    1.5555,
+    1.5067,
+    1.4598,
+    1.4146,
+    1.3711,
+    1.3294,
+    1.2893,
+    1.2509,
+    1.2139,
+    1.1785,
+    1.1445,
+    1.112,
+    1.0807,
+    1.0508,
+    1.0221,
+    0.9945,
+    0.9681,
+    0.9428,
+    0.9185,
+    0.8952,
+    0.8729,
+    0.8514,
+    0.8308,
+    0.8111,
+    0.7921,
+    0.7738,
+    0.7563,
+    0.7395,
+    0.7233,
+    0.7077
+  ],
+  "final_loss": 0.7002,
+  "labels": [
+    "spend_by_month",
+    "top_vendors",
+    "late_vendors",
+    "late_rate",
+    "spend_by_category",
+    "why_q2",
+    "below_reorder",
+    "open_invoices",
+    "return_reasons",
+    "ap_health"
+  ],
+  "backend": "local",
+  "dataset_jsonl": "backend/finetune/erp_sft.jsonl",
+  "production_recipe": {
+    "base_model": "openbmb/MiniCPM3-4B",
+    "method": "LoRA (PEFT) supervised fine-tuning (TRL SFTTrainer)",
+    "dataset": "backend/finetune/erp_sft.jsonl",
+    "prompt_template": "{instruction}\n\nERP question: {input}\nSQL:",
+    "hyperparams": {
+      "lora_r": 16,
+      "lora_alpha": 32,
+      "lora_dropout": 0.05,
+      "target_modules": [
+        "q_proj",
+        "k_proj",
+        "v_proj",
+        "o_proj"
+      ],
+      "learning_rate": 0.0002,
+      "num_train_epochs": 3,
+      "per_device_train_batch_size": 8,
+      "gradient_accumulation_steps": 2,
+      "max_seq_length": 1024,
+      "bf16": true
+    },
+    "command": "python scripts/finetune_erp.py --backend hf",
+    "requirements": [
+      "torch",
+      "transformers>=4.44",
+      "peft",
+      "trl",
+      "accelerate",
+      "datasets"
+    ]
+  },
+  "base_model_for_production": "openbmb/MiniCPM3-4B",
+  "generated_at": 1781324577.1492128
+}

backend/finetune/runs/local_20260612T212332.json ADDED Viewed

	@@ -0,0 +1,106 @@

+{
+  "kind": "offline-domain-adaptation",
+  "model": "ERP-NLQ-router (softmax over hashed n-grams, numpy)",
+  "note": "Offline CPU demo of the training loop + eval on the SAME dataset the MiniCPM3-4B LoRA recipe consumes. Trains the NLQ routing head that sits in front of the small model; production fine-tune = OpenBMB MiniCPM3-4B LoRA.",
+  "dataset_size": 120,
+  "train": 96,
+  "test": 24,
+  "n_classes": 10,
+  "trainable_params": 40970,
+  "epochs": 400,
+  "before_test_accuracy": 0.083,
+  "after_test_accuracy": 0.917,
+  "accuracy_gain": 0.833,
+  "routed_sql_exec_rate": 1.0,
+  "loss_curve": [
+    2.3042,
+    2.1773,
+    2.0612,
+    1.9523,
+    1.8493,
+    1.752,
+    1.6601,
+    1.5736,
+    1.4922,
+    1.4159,
+    1.3444,
+    1.2776,
+    1.2152,
+    1.1569,
+    1.1026,
+    1.052,
+    1.0048,
+    0.9607,
+    0.9197,
+    0.8813,
+    0.8456,
+    0.8122,
+    0.7809,
+    0.7517,
+    0.7243,
+    0.6987,
+    0.6746,
+    0.6521,
+    0.6308,
+    0.6109,
+    0.5921,
+    0.5744,
+    0.5577,
+    0.542,
+    0.5271,
+    0.513,
+    0.4997,
+    0.4871,
+    0.4752,
+    0.4638
+  ],
+  "final_loss": 0.4541,
+  "labels": [
+    "spend_by_month",
+    "top_vendors",
+    "late_vendors",
+    "late_rate",
+    "spend_by_category",
+    "why_q2",
+    "below_reorder",
+    "open_invoices",
+    "return_reasons",
+    "ap_health"
+  ],
+  "backend": "local",
+  "dataset_jsonl": "backend/finetune/erp_sft.jsonl",
+  "production_recipe": {
+    "base_model": "openbmb/MiniCPM3-4B",
+    "method": "LoRA (PEFT) supervised fine-tuning (TRL SFTTrainer)",
+    "dataset": "backend/finetune/erp_sft.jsonl",
+    "prompt_template": "{instruction}\n\nERP question: {input}\nSQL:",
+    "hyperparams": {
+      "lora_r": 16,
+      "lora_alpha": 32,
+      "lora_dropout": 0.05,
+      "target_modules": [
+        "q_proj",
+        "k_proj",
+        "v_proj",
+        "o_proj"
+      ],
+      "learning_rate": 0.0002,
+      "num_train_epochs": 3,
+      "per_device_train_batch_size": 8,
+      "gradient_accumulation_steps": 2,
+      "max_seq_length": 1024,
+      "bf16": true
+    },
+    "command": "python scripts/finetune_erp.py --backend hf",
+    "requirements": [
+      "torch",
+      "transformers>=4.44",
+      "peft",
+      "trl",
+      "accelerate",
+      "datasets"
+    ]
+  },
+  "base_model_for_production": "openbmb/MiniCPM3-4B",
+  "generated_at": 1781324612.290891
+}

backend/finetune/runs/local_20260612T212357.json ADDED Viewed

	@@ -0,0 +1,108 @@

+{
+  "kind": "offline-domain-adaptation",
+  "model": "ERP-NLQ-router (softmax over hashed n-grams, numpy)",
+  "note": "Offline CPU demo of the training loop + eval on the SAME dataset the MiniCPM3-4B LoRA recipe consumes. Trains the NLQ routing head that sits in front of the small model; production fine-tune = OpenBMB MiniCPM3-4B LoRA.",
+  "dataset_size": 120,
+  "train": 96,
+  "test": 24,
+  "n_classes": 10,
+  "trainable_params": 40970,
+  "epochs": 250,
+  "before_test_accuracy": 0.083,
+  "after_test_accuracy": 0.875,
+  "accuracy_gain": 0.792,
+  "routed_sql_exec_rate": 1.0,
+  "loss_curve": [
+    2.3042,
+    2.2264,
+    2.1534,
+    2.0838,
+    2.0169,
+    1.9523,
+    1.8898,
+    1.8294,
+    1.771,
+    1.7146,
+    1.6601,
+    1.6076,
+    1.5569,
+    1.5081,
+    1.4611,
+    1.4159,
+    1.3724,
+    1.3307,
+    1.2906,
+    1.2521,
+    1.2152,
+    1.1798,
+    1.1458,
+    1.1132,
+    1.0819,
+    1.052,
+    1.0232,
+    0.9957,
+    0.9693,
+    0.944,
+    0.9197,
+    0.8964,
+    0.874,
+    0.8525,
+    0.8319,
+    0.8122,
+    0.7932,
+    0.7749,
+    0.7574,
+    0.7405,
+    0.7243,
+    0.7087
+  ],
+  "final_loss": 0.7012,
+  "labels": [
+    "spend_by_month",
+    "top_vendors",
+    "late_vendors",
+    "late_rate",
+    "spend_by_category",
+    "why_q2",
+    "below_reorder",
+    "open_invoices",
+    "return_reasons",
+    "ap_health"
+  ],
+  "backend": "local",
+  "dataset_jsonl": "backend/finetune/erp_sft.jsonl",
+  "production_recipe": {
+    "base_model": "openbmb/MiniCPM3-4B",
+    "method": "LoRA (PEFT) supervised fine-tuning (TRL SFTTrainer)",
+    "dataset": "backend/finetune/erp_sft.jsonl",
+    "prompt_template": "{instruction}\n\nERP question: {input}\nSQL:",
+    "hyperparams": {
+      "lora_r": 16,
+      "lora_alpha": 32,
+      "lora_dropout": 0.05,
+      "target_modules": [
+        "q_proj",
+        "k_proj",
+        "v_proj",
+        "o_proj"
+      ],
+      "learning_rate": 0.0002,
+      "num_train_epochs": 3,
+      "per_device_train_batch_size": 8,
+      "gradient_accumulation_steps": 2,
+      "max_seq_length": 1024,
+      "bf16": true
+    },
+    "command": "python scripts/finetune_erp.py --backend hf",
+    "requirements": [
+      "torch",
+      "transformers>=4.44",
+      "peft",
+      "trl",
+      "accelerate",
+      "datasets"
+    ]
+  },
+  "base_model_for_production": "openbmb/MiniCPM3-4B",
+  "generated_at": 1781324637.309176
+}

backend/finetune/runs/local_20260612T212413.json ADDED Viewed

	@@ -0,0 +1,106 @@

+{
+  "kind": "offline-domain-adaptation",
+  "model": "ERP-NLQ-router (softmax over hashed n-grams, numpy)",
+  "note": "Offline CPU demo of the training loop + eval on the SAME dataset the MiniCPM3-4B LoRA recipe consumes. Trains the NLQ routing head that sits in front of the small model; production fine-tune = OpenBMB MiniCPM3-4B LoRA.",
+  "dataset_size": 120,
+  "train": 96,
+  "test": 24,
+  "n_classes": 10,
+  "trainable_params": 40970,
+  "epochs": 400,
+  "before_test_accuracy": 0.083,
+  "after_test_accuracy": 0.917,
+  "accuracy_gain": 0.833,
+  "routed_sql_exec_rate": 1.0,
+  "loss_curve": [
+    2.3042,
+    2.1773,
+    2.0612,
+    1.9523,
+    1.8493,
+    1.752,
+    1.6601,
+    1.5736,
+    1.4922,
+    1.4159,
+    1.3444,
+    1.2776,
+    1.2152,
+    1.1569,
+    1.1026,
+    1.052,
+    1.0048,
+    0.9607,
+    0.9197,
+    0.8813,
+    0.8456,
+    0.8122,
+    0.7809,
+    0.7517,
+    0.7243,
+    0.6987,
+    0.6746,
+    0.6521,
+    0.6308,
+    0.6109,
+    0.5921,
+    0.5744,
+    0.5577,
+    0.542,
+    0.5271,
+    0.513,
+    0.4997,
+    0.4871,
+    0.4752,
+    0.4638
+  ],
+  "final_loss": 0.4541,
+  "labels": [
+    "spend_by_month",
+    "top_vendors",
+    "late_vendors",
+    "late_rate",
+    "spend_by_category",
+    "why_q2",
+    "below_reorder",
+    "open_invoices",
+    "return_reasons",
+    "ap_health"
+  ],
+  "backend": "local",
+  "dataset_jsonl": "backend/finetune/erp_sft.jsonl",
+  "production_recipe": {
+    "base_model": "openbmb/MiniCPM3-4B",
+    "method": "LoRA (PEFT) supervised fine-tuning (TRL SFTTrainer)",
+    "dataset": "backend/finetune/erp_sft.jsonl",
+    "prompt_template": "{instruction}\n\nERP question: {input}\nSQL:",
+    "hyperparams": {
+      "lora_r": 16,
+      "lora_alpha": 32,
+      "lora_dropout": 0.05,
+      "target_modules": [
+        "q_proj",
+        "k_proj",
+        "v_proj",
+        "o_proj"
+      ],
+      "learning_rate": 0.0002,
+      "num_train_epochs": 3,
+      "per_device_train_batch_size": 8,
+      "gradient_accumulation_steps": 2,
+      "max_seq_length": 1024,
+      "bf16": true
+    },
+    "command": "python scripts/finetune_erp.py --backend hf",
+    "requirements": [
+      "torch",
+      "transformers>=4.44",
+      "peft",
+      "trl",
+      "accelerate",
+      "datasets"
+    ]
+  },
+  "base_model_for_production": "openbmb/MiniCPM3-4B",
+  "generated_at": 1781324653.616286
+}

gradio_app.py CHANGED Viewed

@@ -101,6 +101,35 @@ def search(query: str):
             for r in RAG.search(query, k=8)]
 def run_complex_web_automation():
     """Intricate multi-step browser automation: ERP dashboard → Procurement →
     +Create Order → read the complex order-form fields."""
@@ -136,6 +165,21 @@ with gr.Blocks(title="Aperture — Retail Document Intelligence") as demo:
         kpis = gr.Markdown(_kpis_md())
         run_btn.click(run_sample, [sample, backend], [extracted, summary, kpis])
         upload_btn.click(run_upload, [upload, backend], [extracted, summary, kpis])
     with gr.Tab("Search (RAG)"):
         q = gr.Textbox(label="Query", placeholder="e.g. POS Cloud subscription renewal")
         search_btn = gr.Button("🔍 Search")

             for r in RAG.search(query, k=8)]
+def erp_ask(question: str):
+    """ERP DocIQ: NLQ / analytics / summary / 'why' over the simulated ERP knowledgebase."""
+    from app.erp import ErpChat, get_warehouse
+    if not (question or "").strip():
+        return "Ask about spend, vendors, late payments, inventory or returns.", []
+    chat = ErpChat(S, router=ROUTER, warehouse=get_warehouse(S), metrics=METRICS)
+    r = chat.answer(question)
+    md = (f"**{r['intent']}** · {r['engine']} · {r['model']} · {r['latency_ms']} ms\n\n"
+          f"{r['answer']}\n\n" + (f"```sql\n{r['sql']}\n```" if r.get("sql") else ""))
+    rows = r.get("rows") or []
+    table = [[*(str(v) for v in row)] for row in rows[:12]] if rows else []
+    return md, table
+def _erp_finetune_md() -> str:
+    import json as _json
+    from pathlib import Path
+    p = Path(__file__).resolve().parent / "backend" / "finetune" / "erp_finetune_report.json"
+    if not p.exists():
+        return "_Run `python scripts/finetune_erp.py` to populate fine-tune metrics._"
+    d = _json.loads(p.read_text()); od = d.get("offline_demo") or d
+    return ("### ERP-domain fine-tuning\n"
+            f"- **Production target:** OpenBMB **MiniCPM3-4B** (LoRA recipe emitted)\n"
+            f"- **Offline demo (CPU):** before **{od['before_test_accuracy']*100:.1f}%** → "
+            f"after **{od['after_test_accuracy']*100:.1f}%** "
+            f"(**+{od['accuracy_gain']*100:.0f} pts**) on {od['dataset_size']} examples; "
+            f"routed-SQL exec {od['routed_sql_exec_rate']*100:.0f}%")
 def run_complex_web_automation():
     """Intricate multi-step browser automation: ERP dashboard → Procurement →
     +Create Order → read the complex order-form fields."""
         kpis = gr.Markdown(_kpis_md())
         run_btn.click(run_sample, [sample, backend], [extracted, summary, kpis])
         upload_btn.click(run_upload, [upload, backend], [extracted, summary, kpis])
+    with gr.Tab("ERP DocIQ (chat)"):
+        gr.Markdown("### Ask your ERP reports — NLQ · analytics · summary · reasons\n"
+                    "Natural-language questions over a simulated retail ERP (vendors, POs, invoices, "
+                    "GL, inventory, returns). Figures come from **real SQL**; OpenBMB **MiniCPM3-4B** "
+                    "phrases summaries & explanations and never invents numbers.")
+        erp_q = gr.Textbox(label="Question",
+                           placeholder="e.g. Why did spend rise in Q2 2026?")
+        gr.Examples(["Who are the top 5 vendors by spend?", "What is the late-payment rate overall?",
+                     "Why did spend rise in Q2 2026?", "Summarize accounts payable health.",
+                     "Top return reasons by refund amount?"], inputs=erp_q)
+        erp_btn = gr.Button("💬 Ask ERP DocIQ", variant="primary")
+        erp_answer = gr.Markdown()
+        erp_rows = gr.Dataframe(label="Query result (real SQL over the warehouse)")
+        erp_btn.click(erp_ask, [erp_q], [erp_answer, erp_rows])
+        gr.Markdown(_erp_finetune_md())
     with gr.Tab("Search (RAG)"):
         q = gr.Textbox(label="Query", placeholder="e.g. POS Cloud subscription renewal")
         search_btn = gr.Button("🔍 Search")

results/erp_finetune_report.json ADDED Viewed

	@@ -0,0 +1,106 @@

+{
+  "kind": "offline-domain-adaptation",
+  "model": "ERP-NLQ-router (softmax over hashed n-grams, numpy)",
+  "note": "Offline CPU demo of the training loop + eval on the SAME dataset the MiniCPM3-4B LoRA recipe consumes. Trains the NLQ routing head that sits in front of the small model; production fine-tune = OpenBMB MiniCPM3-4B LoRA.",
+  "dataset_size": 120,
+  "train": 96,
+  "test": 24,
+  "n_classes": 10,
+  "trainable_params": 40970,
+  "epochs": 400,
+  "before_test_accuracy": 0.083,
+  "after_test_accuracy": 0.917,
+  "accuracy_gain": 0.833,
+  "routed_sql_exec_rate": 1.0,
+  "loss_curve": [
+    2.3042,
+    2.1773,
+    2.0612,
+    1.9523,
+    1.8493,
+    1.752,
+    1.6601,
+    1.5736,
+    1.4922,
+    1.4159,
+    1.3444,
+    1.2776,
+    1.2152,
+    1.1569,
+    1.1026,
+    1.052,
+    1.0048,
+    0.9607,
+    0.9197,
+    0.8813,
+    0.8456,
+    0.8122,
+    0.7809,
+    0.7517,
+    0.7243,
+    0.6987,
+    0.6746,
+    0.6521,
+    0.6308,
+    0.6109,
+    0.5921,
+    0.5744,
+    0.5577,
+    0.542,
+    0.5271,
+    0.513,
+    0.4997,
+    0.4871,
+    0.4752,
+    0.4638
+  ],
+  "final_loss": 0.4541,
+  "labels": [
+    "spend_by_month",
+    "top_vendors",
+    "late_vendors",
+    "late_rate",
+    "spend_by_category",
+    "why_q2",
+    "below_reorder",
+    "open_invoices",
+    "return_reasons",
+    "ap_health"
+  ],
+  "backend": "local",
+  "dataset_jsonl": "backend/finetune/erp_sft.jsonl",
+  "production_recipe": {
+    "base_model": "openbmb/MiniCPM3-4B",
+    "method": "LoRA (PEFT) supervised fine-tuning (TRL SFTTrainer)",
+    "dataset": "backend/finetune/erp_sft.jsonl",
+    "prompt_template": "{instruction}\n\nERP question: {input}\nSQL:",
+    "hyperparams": {
+      "lora_r": 16,
+      "lora_alpha": 32,
+      "lora_dropout": 0.05,
+      "target_modules": [
+        "q_proj",
+        "k_proj",
+        "v_proj",
+        "o_proj"
+      ],
+      "learning_rate": 0.0002,
+      "num_train_epochs": 3,
+      "per_device_train_batch_size": 8,
+      "gradient_accumulation_steps": 2,
+      "max_seq_length": 1024,
+      "bf16": true
+    },
+    "command": "python scripts/finetune_erp.py --backend hf",
+    "requirements": [
+      "torch",
+      "transformers>=4.44",
+      "peft",
+      "trl",
+      "accelerate",
+      "datasets"
+    ]
+  },
+  "base_model_for_production": "openbmb/MiniCPM3-4B",
+  "generated_at": 1781324653.616286
+}

results/erp_sft.jsonl ADDED Viewed

	@@ -0,0 +1,120 @@

+{"task": "nlq", "intent": "reasons", "template": "why_q2", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "Why is Q2 spend up?", "output": "SELECT period, account, ROUND(SUM(amount),2) AS spend FROM gl_entries WHERE period >= '2026-04' AND period <= '2026-06' GROUP BY period, account ORDER BY period, spend DESC"}
+{"task": "nlq", "intent": "summary", "template": "ap_health", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "Summarize our AP position.", "output": "SELECT (SELECT COUNT(*) FROM invoices) AS invoices, (SELECT COUNT(*) FROM invoices WHERE status='open') AS open_invoices, (SELECT ROUND(AVG(days_to_pay),1) FROM invoices WHERE status='paid') AS avg_days_to_pay"}
+{"task": "nlq", "intent": "analytics", "template": "late_vendors", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "Vendors most often paid after due date?", "output": "SELECT v.name, SUM(CASE WHEN i.days_to_pay > CAST(substr(v.payment_terms,5) AS INT) THEN 1 ELSE 0 END) AS late_invoices, COUNT(i.invoice_id) AS paid_invoices FROM invoices i JOIN vendors v ON v.vendor_id=i.vendor_id WHERE i.status='paid' GROUP BY v.vendor_id HAVING late_invoices > 0 ORDER BY late_invoices DESC LIMIT 5"}
+{"task": "nlq", "intent": "analytics", "template": "open_invoices", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "How much do we owe in unpaid invoices?", "output": "SELECT ROUND(SUM(total),2) AS open_value, COUNT(*) AS n FROM invoices WHERE status='open'"}
+{"task": "nlq", "intent": "analytics", "template": "spend_by_month", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "How much did we invoice each month?", "output": "SELECT period, ROUND(SUM(amount),2) AS spend FROM gl_entries GROUP BY period ORDER BY period"}
+{"task": "nlq", "intent": "analytics", "template": "spend_by_category", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "Spend grouped by category.", "output": "SELECT p.category, ROUND(SUM(l.line_total),2) AS spend FROM po_lines l JOIN products p ON p.sku=l.sku JOIN purchase_orders po ON po.po_id=l.po_id WHERE po.status!='cancelled' GROUP BY p.category ORDER BY spend DESC"}
+{"task": "nlq", "intent": "reasons", "template": "why_q2", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "What drove the Q2 spend increase?", "output": "SELECT period, account, ROUND(SUM(amount),2) AS spend FROM gl_entries WHERE period >= '2026-04' AND period <= '2026-06' GROUP BY period, account ORDER BY period, spend DESC"}
+{"task": "nlq", "intent": "analytics", "template": "below_reorder", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "Which products are low on stock?", "output": "SELECT i.sku, p.name, i.region, i.on_hand, i.reorder_point FROM inventory i JOIN products p ON p.sku=i.sku WHERE i.on_hand < i.reorder_point ORDER BY (i.reorder_point - i.on_hand) DESC LIMIT 15"}
+{"task": "nlq", "intent": "summary", "template": "ap_health", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "State of our AP overall?", "output": "SELECT (SELECT COUNT(*) FROM invoices) AS invoices, (SELECT COUNT(*) FROM invoices WHERE status='open') AS open_invoices, (SELECT ROUND(AVG(days_to_pay),1) FROM invoices WHERE status='paid') AS avg_days_to_pay"}
+{"task": "nlq", "intent": "analytics", "template": "spend_by_month", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "What was total invoiced spend by month?", "output": "SELECT period, ROUND(SUM(amount),2) AS spend FROM gl_entries GROUP BY period ORDER BY period"}
+{"task": "nlq", "intent": "analytics", "template": "spend_by_month", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "Show monthly spend.", "output": "SELECT period, ROUND(SUM(amount),2) AS spend FROM gl_entries GROUP BY period ORDER BY period"}
+{"task": "nlq", "intent": "analytics", "template": "spend_by_category", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "Categories ranked by spend.", "output": "SELECT p.category, ROUND(SUM(l.line_total),2) AS spend FROM po_lines l JOIN products p ON p.sku=l.sku JOIN purchase_orders po ON po.po_id=l.po_id WHERE po.status!='cancelled' GROUP BY p.category ORDER BY spend DESC"}
+{"task": "nlq", "intent": "analytics", "template": "spend_by_category", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "Where does spend go by category?", "output": "SELECT p.category, ROUND(SUM(l.line_total),2) AS spend FROM po_lines l JOIN products p ON p.sku=l.sku JOIN purchase_orders po ON po.po_id=l.po_id WHERE po.status!='cancelled' GROUP BY p.category ORDER BY spend DESC"}
+{"task": "nlq", "intent": "analytics", "template": "open_invoices", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "Total open AP balance?", "output": "SELECT ROUND(SUM(total),2) AS open_value, COUNT(*) AS n FROM invoices WHERE status='open'"}
+{"task": "nlq", "intent": "analytics", "template": "late_rate", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "Share of late payments?", "output": "SELECT ROUND(100.0*SUM(CASE WHEN i.days_to_pay > CAST(substr(v.payment_terms,5) AS INT) THEN 1 ELSE 0 END)/COUNT(*),1) AS late_pct, COUNT(*) AS paid_invoices FROM invoices i JOIN vendors v ON v.vendor_id=i.vendor_id WHERE i.status='paid'"}
+{"task": "nlq", "intent": "analytics", "template": "top_vendors", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "Rank vendors by spend.", "output": "SELECT v.name, ROUND(SUM(i.total),2) AS spend, COUNT(*) AS invoices FROM invoices i JOIN vendors v ON v.vendor_id=i.vendor_id GROUP BY v.vendor_id ORDER BY spend DESC LIMIT 5"}
+{"task": "nlq", "intent": "analytics", "template": "late_rate", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "Overall on-time vs late rate?", "output": "SELECT ROUND(100.0*SUM(CASE WHEN i.days_to_pay > CAST(substr(v.payment_terms,5) AS INT) THEN 1 ELSE 0 END)/COUNT(*),1) AS late_pct, COUNT(*) AS paid_invoices FROM invoices i JOIN vendors v ON v.vendor_id=i.vendor_id WHERE i.status='paid'"}
+{"task": "nlq", "intent": "analytics", "template": "top_vendors", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "List our biggest suppliers.", "output": "SELECT v.name, ROUND(SUM(i.total),2) AS spend, COUNT(*) AS invoices FROM invoices i JOIN vendors v ON v.vendor_id=i.vendor_id GROUP BY v.vendor_id ORDER BY spend DESC LIMIT 5"}
+{"task": "nlq", "intent": "analytics", "template": "late_vendors", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "Who are our worst late-paying vendors?", "output": "SELECT v.name, SUM(CASE WHEN i.days_to_pay > CAST(substr(v.payment_terms,5) AS INT) THEN 1 ELSE 0 END) AS late_invoices, COUNT(i.invoice_id) AS paid_invoices FROM invoices i JOIN vendors v ON v.vendor_id=i.vendor_id WHERE i.status='paid' GROUP BY v.vendor_id HAVING late_invoices > 0 ORDER BY late_invoices DESC LIMIT 5"}
+{"task": "nlq", "intent": "analytics", "template": "return_reasons", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "Return reason breakdown by money?", "output": "SELECT reason, COUNT(*) AS returns, ROUND(SUM(refund_amount),2) AS refunds FROM returns GROUP BY reason ORDER BY refunds DESC"}
+{"task": "nlq", "intent": "analytics", "template": "below_reorder", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "What should we reorder?", "output": "SELECT i.sku, p.name, i.region, i.on_hand, i.reorder_point FROM inventory i JOIN products p ON p.sku=i.sku WHERE i.on_hand < i.reorder_point ORDER BY (i.reorder_point - i.on_hand) DESC LIMIT 15"}
+{"task": "nlq", "intent": "analytics", "template": "below_reorder", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "Which items risk stockout?", "output": "SELECT i.sku, p.name, i.region, i.on_hand, i.reorder_point FROM inventory i JOIN products p ON p.sku=i.sku WHERE i.on_hand < i.reorder_point ORDER BY (i.reorder_point - i.on_hand) DESC LIMIT 15"}
+{"task": "nlq", "intent": "analytics", "template": "top_vendors", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "Which vendors get the most of our money?", "output": "SELECT v.name, ROUND(SUM(i.total),2) AS spend, COUNT(*) AS invoices FROM invoices i JOIN vendors v ON v.vendor_id=i.vendor_id GROUP BY v.vendor_id ORDER BY spend DESC LIMIT 5"}
+{"task": "nlq", "intent": "analytics", "template": "return_reasons", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "What are the leading causes of returns?", "output": "SELECT reason, COUNT(*) AS returns, ROUND(SUM(refund_amount),2) AS refunds FROM returns GROUP BY reason ORDER BY refunds DESC"}
+{"task": "nlq", "intent": "analytics", "template": "spend_by_category", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "Show spend by category.", "output": "SELECT p.category, ROUND(SUM(l.line_total),2) AS spend FROM po_lines l JOIN products p ON p.sku=l.sku JOIN purchase_orders po ON po.po_id=l.po_id WHERE po.status!='cancelled' GROUP BY p.category ORDER BY spend DESC"}
+{"task": "nlq", "intent": "analytics", "template": "late_rate", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "Late-payment ratio across all invoices?", "output": "SELECT ROUND(100.0*SUM(CASE WHEN i.days_to_pay > CAST(substr(v.payment_terms,5) AS INT) THEN 1 ELSE 0 END)/COUNT(*),1) AS late_pct, COUNT(*) AS paid_invoices FROM invoices i JOIN vendors v ON v.vendor_id=i.vendor_id WHERE i.status='paid'"}
+{"task": "nlq", "intent": "analytics", "template": "below_reorder", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "What's running low in inventory?", "output": "SELECT i.sku, p.name, i.region, i.on_hand, i.reorder_point FROM inventory i JOIN products p ON p.sku=i.sku WHERE i.on_hand < i.reorder_point ORDER BY (i.reorder_point - i.on_hand) DESC LIMIT 15"}
+{"task": "nlq", "intent": "analytics", "template": "return_reasons", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "Biggest return reasons by refund value?", "output": "SELECT reason, COUNT(*) AS returns, ROUND(SUM(refund_amount),2) AS refunds FROM returns GROUP BY reason ORDER BY refunds DESC"}
+{"task": "nlq", "intent": "summary", "template": "ap_health", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "How healthy are our payables?", "output": "SELECT (SELECT COUNT(*) FROM invoices) AS invoices, (SELECT COUNT(*) FROM invoices WHERE status='open') AS open_invoices, (SELECT ROUND(AVG(days_to_pay),1) FROM invoices WHERE status='paid') AS avg_days_to_pay"}
+{"task": "nlq", "intent": "analytics", "template": "late_vendors", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "Which suppliers have payment delays?", "output": "SELECT v.name, SUM(CASE WHEN i.days_to_pay > CAST(substr(v.payment_terms,5) AS INT) THEN 1 ELSE 0 END) AS late_invoices, COUNT(i.invoice_id) AS paid_invoices FROM invoices i JOIN vendors v ON v.vendor_id=i.vendor_id WHERE i.status='paid' GROUP BY v.vendor_id HAVING late_invoices > 0 ORDER BY late_invoices DESC LIMIT 5"}
+{"task": "nlq", "intent": "analytics", "template": "return_reasons", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "Which return reasons cost the most?", "output": "SELECT reason, COUNT(*) AS returns, ROUND(SUM(refund_amount),2) AS refunds FROM returns GROUP BY reason ORDER BY refunds DESC"}
+{"task": "nlq", "intent": "analytics", "template": "spend_by_month", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "Break spend down per month.", "output": "SELECT period, ROUND(SUM(amount),2) AS spend FROM gl_entries GROUP BY period ORDER BY period"}
+{"task": "nlq", "intent": "analytics", "template": "spend_by_category", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "Category spend totals?", "output": "SELECT p.category, ROUND(SUM(l.line_total),2) AS spend FROM po_lines l JOIN products p ON p.sku=l.sku JOIN purchase_orders po ON po.po_id=l.po_id WHERE po.status!='cancelled' GROUP BY p.category ORDER BY spend DESC"}
+{"task": "nlq", "intent": "analytics", "template": "late_rate", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "Percentage of overdue payments overall?", "output": "SELECT ROUND(100.0*SUM(CASE WHEN i.days_to_pay > CAST(substr(v.payment_terms,5) AS INT) THEN 1 ELSE 0 END)/COUNT(*),1) AS late_pct, COUNT(*) AS paid_invoices FROM invoices i JOIN vendors v ON v.vendor_id=i.vendor_id WHERE i.status='paid'"}
+{"task": "nlq", "intent": "summary", "template": "ap_health", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "AP health check please.", "output": "SELECT (SELECT COUNT(*) FROM invoices) AS invoices, (SELECT COUNT(*) FROM invoices WHERE status='open') AS open_invoices, (SELECT ROUND(AVG(days_to_pay),1) FROM invoices WHERE status='paid') AS avg_days_to_pay"}
+{"task": "nlq", "intent": "analytics", "template": "spend_by_category", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "Which categories cost the most?", "output": "SELECT p.category, ROUND(SUM(l.line_total),2) AS spend FROM po_lines l JOIN products p ON p.sku=l.sku JOIN purchase_orders po ON po.po_id=l.po_id WHERE po.status!='cancelled' GROUP BY p.category ORDER BY spend DESC"}
+{"task": "nlq", "intent": "analytics", "template": "spend_by_month", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "Monthly AP spend totals?", "output": "SELECT period, ROUND(SUM(amount),2) AS spend FROM gl_entries GROUP BY period ORDER BY period"}
+{"task": "nlq", "intent": "reasons", "template": "why_q2", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "What caused the second-quarter cost jump?", "output": "SELECT period, account, ROUND(SUM(amount),2) AS spend FROM gl_entries WHERE period >= '2026-04' AND period <= '2026-06' GROUP BY period, account ORDER BY period, spend DESC"}
+{"task": "nlq", "intent": "analytics", "template": "late_vendors", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "Who keeps getting paid past terms?", "output": "SELECT v.name, SUM(CASE WHEN i.days_to_pay > CAST(substr(v.payment_terms,5) AS INT) THEN 1 ELSE 0 END) AS late_invoices, COUNT(i.invoice_id) AS paid_invoices FROM invoices i JOIN vendors v ON v.vendor_id=i.vendor_id WHERE i.status='paid' GROUP BY v.vendor_id HAVING late_invoices > 0 ORDER BY late_invoices DESC LIMIT 5"}
+{"task": "nlq", "intent": "reasons", "template": "why_q2", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "Account for the Q2 2026 spend surge.", "output": "SELECT period, account, ROUND(SUM(amount),2) AS spend FROM gl_entries WHERE period >= '2026-04' AND period <= '2026-06' GROUP BY period, account ORDER BY period, spend DESC"}
+{"task": "nlq", "intent": "summary", "template": "ap_health", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "What's the state of accounts payable?", "output": "SELECT (SELECT COUNT(*) FROM invoices) AS invoices, (SELECT COUNT(*) FROM invoices WHERE status='open') AS open_invoices, (SELECT ROUND(AVG(days_to_pay),1) FROM invoices WHERE status='paid') AS avg_days_to_pay"}
+{"task": "nlq", "intent": "analytics", "template": "late_rate", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "Our overall late payment percentage?", "output": "SELECT ROUND(100.0*SUM(CASE WHEN i.days_to_pay > CAST(substr(v.payment_terms,5) AS INT) THEN 1 ELSE 0 END)/COUNT(*),1) AS late_pct, COUNT(*) AS paid_invoices FROM invoices i JOIN vendors v ON v.vendor_id=i.vendor_id WHERE i.status='paid'"}
+{"task": "nlq", "intent": "analytics", "template": "below_reorder", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "Inventory below reorder point?", "output": "SELECT i.sku, p.name, i.region, i.on_hand, i.reorder_point FROM inventory i JOIN products p ON p.sku=i.sku WHERE i.on_hand < i.reorder_point ORDER BY (i.reorder_point - i.on_hand) DESC LIMIT 15"}
+{"task": "nlq", "intent": "analytics", "template": "open_invoices", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "What is the total value of open invoices?", "output": "SELECT ROUND(SUM(total),2) AS open_value, COUNT(*) AS n FROM invoices WHERE status='open'"}
+{"task": "nlq", "intent": "analytics", "template": "top_vendors", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "Top suppliers ranked by spend.", "output": "SELECT v.name, ROUND(SUM(i.total),2) AS spend, COUNT(*) AS invoices FROM invoices i JOIN vendors v ON v.vendor_id=i.vendor_id GROUP BY v.vendor_id ORDER BY spend DESC LIMIT 5"}
+{"task": "nlq", "intent": "analytics", "template": "late_vendors", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "List vendors by late-payment count.", "output": "SELECT v.name, SUM(CASE WHEN i.days_to_pay > CAST(substr(v.payment_terms,5) AS INT) THEN 1 ELSE 0 END) AS late_invoices, COUNT(i.invoice_id) AS paid_invoices FROM invoices i JOIN vendors v ON v.vendor_id=i.vendor_id WHERE i.status='paid' GROUP BY v.vendor_id HAVING late_invoices > 0 ORDER BY late_invoices DESC LIMIT 5"}
+{"task": "nlq", "intent": "analytics", "template": "late_vendors", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "Late payers among our vendors?", "output": "SELECT v.name, SUM(CASE WHEN i.days_to_pay > CAST(substr(v.payment_terms,5) AS INT) THEN 1 ELSE 0 END) AS late_invoices, COUNT(i.invoice_id) AS paid_invoices FROM invoices i JOIN vendors v ON v.vendor_id=i.vendor_id WHERE i.status='paid' GROUP BY v.vendor_id HAVING late_invoices > 0 ORDER BY late_invoices DESC LIMIT 5"}
+{"task": "nlq", "intent": "analytics", "template": "spend_by_category", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "Total spend for each category?", "output": "SELECT p.category, ROUND(SUM(l.line_total),2) AS spend FROM po_lines l JOIN products p ON p.sku=l.sku JOIN purchase_orders po ON po.po_id=l.po_id WHERE po.status!='cancelled' GROUP BY p.category ORDER BY spend DESC"}
+{"task": "nlq", "intent": "analytics", "template": "late_rate", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "What is the late-payment rate overall?", "output": "SELECT ROUND(100.0*SUM(CASE WHEN i.days_to_pay > CAST(substr(v.payment_terms,5) AS INT) THEN 1 ELSE 0 END)/COUNT(*),1) AS late_pct, COUNT(*) AS paid_invoices FROM invoices i JOIN vendors v ON v.vendor_id=i.vendor_id WHERE i.status='paid'"}
+{"task": "nlq", "intent": "summary", "template": "ap_health", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "How are we doing on accounts payable?", "output": "SELECT (SELECT COUNT(*) FROM invoices) AS invoices, (SELECT COUNT(*) FROM invoices WHERE status='open') AS open_invoices, (SELECT ROUND(AVG(days_to_pay),1) FROM invoices WHERE status='paid') AS avg_days_to_pay"}
+{"task": "nlq", "intent": "analytics", "template": "return_reasons", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "Show returns grouped by reason.", "output": "SELECT reason, COUNT(*) AS returns, ROUND(SUM(refund_amount),2) AS refunds FROM returns GROUP BY reason ORDER BY refunds DESC"}
+{"task": "nlq", "intent": "reasons", "template": "why_q2", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "Explain the spend spike in Q2.", "output": "SELECT period, account, ROUND(SUM(amount),2) AS spend FROM gl_entries WHERE period >= '2026-04' AND period <= '2026-06' GROUP BY period, account ORDER BY period, spend DESC"}
+{"task": "nlq", "intent": "analytics", "template": "late_rate", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "What's our late payment rate?", "output": "SELECT ROUND(100.0*SUM(CASE WHEN i.days_to_pay > CAST(substr(v.payment_terms,5) AS INT) THEN 1 ELSE 0 END)/COUNT(*),1) AS late_pct, COUNT(*) AS paid_invoices FROM invoices i JOIN vendors v ON v.vendor_id=i.vendor_id WHERE i.status='paid'"}
+{"task": "nlq", "intent": "analytics", "template": "return_reasons", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "Break down returns by reason.", "output": "SELECT reason, COUNT(*) AS returns, ROUND(SUM(refund_amount),2) AS refunds FROM returns GROUP BY reason ORDER BY refunds DESC"}
+{"task": "nlq", "intent": "analytics", "template": "open_invoices", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "Outstanding invoice value?", "output": "SELECT ROUND(SUM(total),2) AS open_value, COUNT(*) AS n FROM invoices WHERE status='open'"}
+{"task": "nlq", "intent": "analytics", "template": "below_reorder", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "List SKUs below reorder threshold.", "output": "SELECT i.sku, p.name, i.region, i.on_hand, i.reorder_point FROM inventory i JOIN products p ON p.sku=i.sku WHERE i.on_hand < i.reorder_point ORDER BY (i.reorder_point - i.on_hand) DESC LIMIT 15"}
+{"task": "nlq", "intent": "summary", "template": "ap_health", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "Give an executive AP summary.", "output": "SELECT (SELECT COUNT(*) FROM invoices) AS invoices, (SELECT COUNT(*) FROM invoices WHERE status='open') AS open_invoices, (SELECT ROUND(AVG(days_to_pay),1) FROM invoices WHERE status='paid') AS avg_days_to_pay"}
+{"task": "nlq", "intent": "analytics", "template": "open_invoices", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "Total of invoices not yet paid?", "output": "SELECT ROUND(SUM(total),2) AS open_value, COUNT(*) AS n FROM invoices WHERE status='open'"}
+{"task": "nlq", "intent": "reasons", "template": "why_q2", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "Why did spend rise in Q2 2026?", "output": "SELECT period, account, ROUND(SUM(amount),2) AS spend FROM gl_entries WHERE period >= '2026-04' AND period <= '2026-06' GROUP BY period, account ORDER BY period, spend DESC"}
+{"task": "nlq", "intent": "analytics", "template": "open_invoices", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "What's our outstanding payables total?", "output": "SELECT ROUND(SUM(total),2) AS open_value, COUNT(*) AS n FROM invoices WHERE status='open'"}
+{"task": "nlq", "intent": "analytics", "template": "late_vendors", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "Worst offenders for late payment?", "output": "SELECT v.name, SUM(CASE WHEN i.days_to_pay > CAST(substr(v.payment_terms,5) AS INT) THEN 1 ELSE 0 END) AS late_invoices, COUNT(i.invoice_id) AS paid_invoices FROM invoices i JOIN vendors v ON v.vendor_id=i.vendor_id WHERE i.status='paid' GROUP BY v.vendor_id HAVING late_invoices > 0 ORDER BY late_invoices DESC LIMIT 5"}
+{"task": "nlq", "intent": "analytics", "template": "top_vendors", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "Our highest-spend vendors?", "output": "SELECT v.name, ROUND(SUM(i.total),2) AS spend, COUNT(*) AS invoices FROM invoices i JOIN vendors v ON v.vendor_id=i.vendor_id GROUP BY v.vendor_id ORDER BY spend DESC LIMIT 5"}
+{"task": "nlq", "intent": "analytics", "template": "spend_by_category", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "How much do we spend per product category?", "output": "SELECT p.category, ROUND(SUM(l.line_total),2) AS spend FROM po_lines l JOIN products p ON p.sku=l.sku JOIN purchase_orders po ON po.po_id=l.po_id WHERE po.status!='cancelled' GROUP BY p.category ORDER BY spend DESC"}
+{"task": "nlq", "intent": "analytics", "template": "return_reasons", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "Why are products being returned?", "output": "SELECT reason, COUNT(*) AS returns, ROUND(SUM(refund_amount),2) AS refunds FROM returns GROUP BY reason ORDER BY refunds DESC"}
+{"task": "nlq", "intent": "reasons", "template": "why_q2", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "Reason for higher spending in Q2 2026?", "output": "SELECT period, account, ROUND(SUM(amount),2) AS spend FROM gl_entries WHERE period >= '2026-04' AND period <= '2026-06' GROUP BY period, account ORDER BY period, spend DESC"}
+{"task": "nlq", "intent": "analytics", "template": "late_vendors", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "Vendors with the most overdue payments?", "output": "SELECT v.name, SUM(CASE WHEN i.days_to_pay > CAST(substr(v.payment_terms,5) AS INT) THEN 1 ELSE 0 END) AS late_invoices, COUNT(i.invoice_id) AS paid_invoices FROM invoices i JOIN vendors v ON v.vendor_id=i.vendor_id WHERE i.status='paid' GROUP BY v.vendor_id HAVING late_invoices > 0 ORDER BY late_invoices DESC LIMIT 5"}
+{"task": "nlq", "intent": "analytics", "template": "open_invoices", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "How much is still open in payables?", "output": "SELECT ROUND(SUM(total),2) AS open_value, COUNT(*) AS n FROM invoices WHERE status='open'"}
+{"task": "nlq", "intent": "analytics", "template": "return_reasons", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "Return reasons ranked by refunds?", "output": "SELECT reason, COUNT(*) AS returns, ROUND(SUM(refund_amount),2) AS refunds FROM returns GROUP BY reason ORDER BY refunds DESC"}
+{"task": "nlq", "intent": "summary", "template": "ap_health", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "Overall accounts payable summary?", "output": "SELECT (SELECT COUNT(*) FROM invoices) AS invoices, (SELECT COUNT(*) FROM invoices WHERE status='open') AS open_invoices, (SELECT ROUND(AVG(days_to_pay),1) FROM invoices WHERE status='paid') AS avg_days_to_pay"}
+{"task": "nlq", "intent": "analytics", "template": "late_rate", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "Give the global late payment percentage.", "output": "SELECT ROUND(100.0*SUM(CASE WHEN i.days_to_pay > CAST(substr(v.payment_terms,5) AS INT) THEN 1 ELSE 0 END)/COUNT(*),1) AS late_pct, COUNT(*) AS paid_invoices FROM invoices i JOIN vendors v ON v.vendor_id=i.vendor_id WHERE i.status='paid'"}
+{"task": "nlq", "intent": "analytics", "template": "late_vendors", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "Which vendors paid late most often?", "output": "SELECT v.name, SUM(CASE WHEN i.days_to_pay > CAST(substr(v.payment_terms,5) AS INT) THEN 1 ELSE 0 END) AS late_invoices, COUNT(i.invoice_id) AS paid_invoices FROM invoices i JOIN vendors v ON v.vendor_id=i.vendor_id WHERE i.status='paid' GROUP BY v.vendor_id HAVING late_invoices > 0 ORDER BY late_invoices DESC LIMIT 5"}
+{"task": "nlq", "intent": "analytics", "template": "top_vendors", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "Which vendors do we spend the most with?", "output": "SELECT v.name, ROUND(SUM(i.total),2) AS spend, COUNT(*) AS invoices FROM invoices i JOIN vendors v ON v.vendor_id=i.vendor_id GROUP BY v.vendor_id ORDER BY spend DESC LIMIT 5"}
+{"task": "nlq", "intent": "analytics", "template": "top_vendors", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "Largest vendors please.", "output": "SELECT v.name, ROUND(SUM(i.total),2) AS spend, COUNT(*) AS invoices FROM invoices i JOIN vendors v ON v.vendor_id=i.vendor_id GROUP BY v.vendor_id ORDER BY spend DESC LIMIT 5"}
+{"task": "nlq", "intent": "analytics", "template": "below_reorder", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "SKUs needing replenishment?", "output": "SELECT i.sku, p.name, i.region, i.on_hand, i.reorder_point FROM inventory i JOIN products p ON p.sku=i.sku WHERE i.on_hand < i.reorder_point ORDER BY (i.reorder_point - i.on_hand) DESC LIMIT 15"}
+{"task": "nlq", "intent": "analytics", "template": "late_rate", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "How often do we pay late, as a rate?", "output": "SELECT ROUND(100.0*SUM(CASE WHEN i.days_to_pay > CAST(substr(v.payment_terms,5) AS INT) THEN 1 ELSE 0 END)/COUNT(*),1) AS late_pct, COUNT(*) AS paid_invoices FROM invoices i JOIN vendors v ON v.vendor_id=i.vendor_id WHERE i.status='paid'"}
+{"task": "nlq", "intent": "analytics", "template": "open_invoices", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "Value of unpaid invoices?", "output": "SELECT ROUND(SUM(total),2) AS open_value, COUNT(*) AS n FROM invoices WHERE status='open'"}
+{"task": "nlq", "intent": "analytics", "template": "open_invoices", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "How much AP is still open?", "output": "SELECT ROUND(SUM(total),2) AS open_value, COUNT(*) AS n FROM invoices WHERE status='open'"}
+{"task": "nlq", "intent": "reasons", "template": "why_q2", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "Drivers of the Q2 spend rise?", "output": "SELECT period, account, ROUND(SUM(amount),2) AS spend FROM gl_entries WHERE period >= '2026-04' AND period <= '2026-06' GROUP BY period, account ORDER BY period, spend DESC"}
+{"task": "nlq", "intent": "analytics", "template": "top_vendors", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "Show the five largest vendors.", "output": "SELECT v.name, ROUND(SUM(i.total),2) AS spend, COUNT(*) AS invoices FROM invoices i JOIN vendors v ON v.vendor_id=i.vendor_id GROUP BY v.vendor_id ORDER BY spend DESC LIMIT 5"}
+{"task": "nlq", "intent": "analytics", "template": "return_reasons", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "What drives our refunds?", "output": "SELECT reason, COUNT(*) AS returns, ROUND(SUM(refund_amount),2) AS refunds FROM returns GROUP BY reason ORDER BY refunds DESC"}
+{"task": "nlq", "intent": "analytics", "template": "late_rate", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "What percent of invoices are paid late?", "output": "SELECT ROUND(100.0*SUM(CASE WHEN i.days_to_pay > CAST(substr(v.payment_terms,5) AS INT) THEN 1 ELSE 0 END)/COUNT(*),1) AS late_pct, COUNT(*) AS paid_invoices FROM invoices i JOIN vendors v ON v.vendor_id=i.vendor_id WHERE i.status='paid'"}
+{"task": "nlq", "intent": "analytics", "template": "top_vendors", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "Which suppliers cost us the most?", "output": "SELECT v.name, ROUND(SUM(i.total),2) AS spend, COUNT(*) AS invoices FROM invoices i JOIN vendors v ON v.vendor_id=i.vendor_id GROUP BY v.vendor_id ORDER BY spend DESC LIMIT 5"}
+{"task": "nlq", "intent": "reasons", "template": "why_q2", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "Explain why Q2 spend went up.", "output": "SELECT period, account, ROUND(SUM(amount),2) AS spend FROM gl_entries WHERE period >= '2026-04' AND period <= '2026-06' GROUP BY period, account ORDER BY period, spend DESC"}
+{"task": "nlq", "intent": "analytics", "template": "spend_by_month", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "Spend by period please.", "output": "SELECT period, ROUND(SUM(amount),2) AS spend FROM gl_entries GROUP BY period ORDER BY period"}
+{"task": "nlq", "intent": "analytics", "template": "open_invoices", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "Open invoice liability?", "output": "SELECT ROUND(SUM(total),2) AS open_value, COUNT(*) AS n FROM invoices WHERE status='open'"}
+{"task": "nlq", "intent": "summary", "template": "ap_health", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "Overview of payables health?", "output": "SELECT (SELECT COUNT(*) FROM invoices) AS invoices, (SELECT COUNT(*) FROM invoices WHERE status='open') AS open_invoices, (SELECT ROUND(AVG(days_to_pay),1) FROM invoices WHERE status='paid') AS avg_days_to_pay"}
+{"task": "nlq", "intent": "summary", "template": "ap_health", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "Summarize payables status.", "output": "SELECT (SELECT COUNT(*) FROM invoices) AS invoices, (SELECT COUNT(*) FROM invoices WHERE status='open') AS open_invoices, (SELECT ROUND(AVG(days_to_pay),1) FROM invoices WHERE status='paid') AS avg_days_to_pay"}
+{"task": "nlq", "intent": "analytics", "template": "open_invoices", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "Sum of open invoices?", "output": "SELECT ROUND(SUM(total),2) AS open_value, COUNT(*) AS n FROM invoices WHERE status='open'"}
+{"task": "nlq", "intent": "analytics", "template": "below_reorder", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "What needs replenishing?", "output": "SELECT i.sku, p.name, i.region, i.on_hand, i.reorder_point FROM inventory i JOIN products p ON p.sku=i.sku WHERE i.on_hand < i.reorder_point ORDER BY (i.reorder_point - i.on_hand) DESC LIMIT 15"}
+{"task": "nlq", "intent": "analytics", "template": "return_reasons", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "Top return reasons by refund amount?", "output": "SELECT reason, COUNT(*) AS returns, ROUND(SUM(refund_amount),2) AS refunds FROM returns GROUP BY reason ORDER BY refunds DESC"}
+{"task": "nlq", "intent": "summary", "template": "ap_health", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "Give me an AP health overview.", "output": "SELECT (SELECT COUNT(*) FROM invoices) AS invoices, (SELECT COUNT(*) FROM invoices WHERE status='open') AS open_invoices, (SELECT ROUND(AVG(days_to_pay),1) FROM invoices WHERE status='paid') AS avg_days_to_pay"}
+{"task": "nlq", "intent": "analytics", "template": "open_invoices", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "Unpaid invoice amount overall?", "output": "SELECT ROUND(SUM(total),2) AS open_value, COUNT(*) AS n FROM invoices WHERE status='open'"}
+{"task": "nlq", "intent": "analytics", "template": "below_reorder", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "Stock positions under reorder point?", "output": "SELECT i.sku, p.name, i.region, i.on_hand, i.reorder_point FROM inventory i JOIN products p ON p.sku=i.sku WHERE i.on_hand < i.reorder_point ORDER BY (i.reorder_point - i.on_hand) DESC LIMIT 15"}
+{"task": "nlq", "intent": "analytics", "template": "late_vendors", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "Show vendors with frequent late payments.", "output": "SELECT v.name, SUM(CASE WHEN i.days_to_pay > CAST(substr(v.payment_terms,5) AS INT) THEN 1 ELSE 0 END) AS late_invoices, COUNT(i.invoice_id) AS paid_invoices FROM invoices i JOIN vendors v ON v.vendor_id=i.vendor_id WHERE i.status='paid' GROUP BY v.vendor_id HAVING late_invoices > 0 ORDER BY late_invoices DESC LIMIT 5"}
+{"task": "nlq", "intent": "analytics", "template": "top_vendors", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "Top vendors by total spend?", "output": "SELECT v.name, ROUND(SUM(i.total),2) AS spend, COUNT(*) AS invoices FROM invoices i JOIN vendors v ON v.vendor_id=i.vendor_id GROUP BY v.vendor_id ORDER BY spend DESC LIMIT 5"}
+{"task": "nlq", "intent": "analytics", "template": "below_reorder", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "Which SKUs are below reorder point?", "output": "SELECT i.sku, p.name, i.region, i.on_hand, i.reorder_point FROM inventory i JOIN products p ON p.sku=i.sku WHERE i.on_hand < i.reorder_point ORDER BY (i.reorder_point - i.on_hand) DESC LIMIT 15"}
+{"task": "nlq", "intent": "summary", "template": "ap_health", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "Summarize accounts payable health.", "output": "SELECT (SELECT COUNT(*) FROM invoices) AS invoices, (SELECT COUNT(*) FROM invoices WHERE status='open') AS open_invoices, (SELECT ROUND(AVG(days_to_pay),1) FROM invoices WHERE status='paid') AS avg_days_to_pay"}
+{"task": "nlq", "intent": "analytics", "template": "spend_by_category", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "Spend per category please.", "output": "SELECT p.category, ROUND(SUM(l.line_total),2) AS spend FROM po_lines l JOIN products p ON p.sku=l.sku JOIN purchase_orders po ON po.po_id=l.po_id WHERE po.status!='cancelled' GROUP BY p.category ORDER BY spend DESC"}
+{"task": "nlq", "intent": "reasons", "template": "why_q2", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "Why did costs climb in Q2 2026?", "output": "SELECT period, account, ROUND(SUM(amount),2) AS spend FROM gl_entries WHERE period >= '2026-04' AND period <= '2026-06' GROUP BY period, account ORDER BY period, spend DESC"}
+{"task": "nlq", "intent": "analytics", "template": "return_reasons", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "Most costly return reasons?", "output": "SELECT reason, COUNT(*) AS returns, ROUND(SUM(refund_amount),2) AS refunds FROM returns GROUP BY reason ORDER BY refunds DESC"}
+{"task": "nlq", "intent": "analytics", "template": "late_vendors", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "Which vendors are habitually overdue?", "output": "SELECT v.name, SUM(CASE WHEN i.days_to_pay > CAST(substr(v.payment_terms,5) AS INT) THEN 1 ELSE 0 END) AS late_invoices, COUNT(i.invoice_id) AS paid_invoices FROM invoices i JOIN vendors v ON v.vendor_id=i.vendor_id WHERE i.status='paid' GROUP BY v.vendor_id HAVING late_invoices > 0 ORDER BY late_invoices DESC LIMIT 5"}
+{"task": "nlq", "intent": "analytics", "template": "spend_by_month", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "What's our spend over the months?", "output": "SELECT period, ROUND(SUM(amount),2) AS spend FROM gl_entries GROUP BY period ORDER BY period"}
+{"task": "nlq", "intent": "analytics", "template": "spend_by_category", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "Break down spend across categories.", "output": "SELECT p.category, ROUND(SUM(l.line_total),2) AS spend FROM po_lines l JOIN products p ON p.sku=l.sku JOIN purchase_orders po ON po.po_id=l.po_id WHERE po.status!='cancelled' GROUP BY p.category ORDER BY spend DESC"}
+{"task": "nlq", "intent": "analytics", "template": "spend_by_category", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "What's our category spend mix?", "output": "SELECT p.category, ROUND(SUM(l.line_total),2) AS spend FROM po_lines l JOIN products p ON p.sku=l.sku JOIN purchase_orders po ON po.po_id=l.po_id WHERE po.status!='cancelled' GROUP BY p.category ORDER BY spend DESC"}
+{"task": "nlq", "intent": "analytics", "template": "spend_by_month", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "How has spend trended month to month?", "output": "SELECT period, ROUND(SUM(amount),2) AS spend FROM gl_entries GROUP BY period ORDER BY period"}
+{"task": "nlq", "intent": "analytics", "template": "spend_by_month", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "Monthly invoiced spend trend?", "output": "SELECT period, ROUND(SUM(amount),2) AS spend FROM gl_entries GROUP BY period ORDER BY period"}
+{"task": "nlq", "intent": "analytics", "template": "late_vendors", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "Which suppliers do we pay late?", "output": "SELECT v.name, SUM(CASE WHEN i.days_to_pay > CAST(substr(v.payment_terms,5) AS INT) THEN 1 ELSE 0 END) AS late_invoices, COUNT(i.invoice_id) AS paid_invoices FROM invoices i JOIN vendors v ON v.vendor_id=i.vendor_id WHERE i.status='paid' GROUP BY v.vendor_id HAVING late_invoices > 0 ORDER BY late_invoices DESC LIMIT 5"}
+{"task": "nlq", "intent": "reasons", "template": "why_q2", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "Why was Q2 so expensive?", "output": "SELECT period, account, ROUND(SUM(amount),2) AS spend FROM gl_entries WHERE period >= '2026-04' AND period <= '2026-06' GROUP BY period, account ORDER BY period, spend DESC"}
+{"task": "nlq", "intent": "analytics", "template": "spend_by_month", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "Total spend grouped by month.", "output": "SELECT period, ROUND(SUM(amount),2) AS spend FROM gl_entries GROUP BY period ORDER BY period"}
+{"task": "nlq", "intent": "analytics", "template": "below_reorder", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "Show items under their reorder level.", "output": "SELECT i.sku, p.name, i.region, i.on_hand, i.reorder_point FROM inventory i JOIN products p ON p.sku=i.sku WHERE i.on_hand < i.reorder_point ORDER BY (i.reorder_point - i.on_hand) DESC LIMIT 15"}
+{"task": "nlq", "intent": "analytics", "template": "late_rate", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "How bad is our late-payment rate?", "output": "SELECT ROUND(100.0*SUM(CASE WHEN i.days_to_pay > CAST(substr(v.payment_terms,5) AS INT) THEN 1 ELSE 0 END)/COUNT(*),1) AS late_pct, COUNT(*) AS paid_invoices FROM invoices i JOIN vendors v ON v.vendor_id=i.vendor_id WHERE i.status='paid'"}
+{"task": "nlq", "intent": "analytics", "template": "top_vendors", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "Who are the top 5 vendors by spend?", "output": "SELECT v.name, ROUND(SUM(i.total),2) AS spend, COUNT(*) AS invoices FROM invoices i JOIN vendors v ON v.vendor_id=i.vendor_id GROUP BY v.vendor_id ORDER BY spend DESC LIMIT 5"}
+{"task": "nlq", "intent": "reasons", "template": "why_q2", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "What's behind the Q2 increase?", "output": "SELECT period, account, ROUND(SUM(amount),2) AS spend FROM gl_entries WHERE period >= '2026-04' AND period <= '2026-06' GROUP BY period, account ORDER BY period, spend DESC"}
+{"task": "nlq", "intent": "analytics", "template": "return_reasons", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "Refund totals per return reason?", "output": "SELECT reason, COUNT(*) AS returns, ROUND(SUM(refund_amount),2) AS refunds FROM returns GROUP BY reason ORDER BY refunds DESC"}
+{"task": "nlq", "intent": "analytics", "template": "spend_by_month", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "Plot spend per month.", "output": "SELECT period, ROUND(SUM(amount),2) AS spend FROM gl_entries GROUP BY period ORDER BY period"}
+{"task": "nlq", "intent": "analytics", "template": "spend_by_month", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "Give me the month-by-month spend.", "output": "SELECT period, ROUND(SUM(amount),2) AS spend FROM gl_entries GROUP BY period ORDER BY period"}
+{"task": "nlq", "intent": "analytics", "template": "below_reorder", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "Which products fell below reorder?", "output": "SELECT i.sku, p.name, i.region, i.on_hand, i.reorder_point FROM inventory i JOIN products p ON p.sku=i.sku WHERE i.on_hand < i.reorder_point ORDER BY (i.reorder_point - i.on_hand) DESC LIMIT 15"}
+{"task": "nlq", "intent": "analytics", "template": "spend_by_category", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "Category-level spend breakdown?", "output": "SELECT p.category, ROUND(SUM(l.line_total),2) AS spend FROM po_lines l JOIN products p ON p.sku=l.sku JOIN purchase_orders po ON po.po_id=l.po_id WHERE po.status!='cancelled' GROUP BY p.category ORDER BY spend DESC"}
+{"task": "nlq", "intent": "analytics", "template": "top_vendors", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "Biggest vendors by invoice value?", "output": "SELECT v.name, ROUND(SUM(i.total),2) AS spend, COUNT(*) AS invoices FROM invoices i JOIN vendors v ON v.vendor_id=i.vendor_id GROUP BY v.vendor_id ORDER BY spend DESC LIMIT 5"}
+{"task": "nlq", "intent": "analytics", "template": "late_rate", "instruction": "Translate this ERP question into one SQLite SELECT over the warehouse schema.", "input": "What fraction of payments miss terms?", "output": "SELECT ROUND(100.0*SUM(CASE WHEN i.days_to_pay > CAST(substr(v.payment_terms,5) AS INT) THEN 1 ELSE 0 END)/COUNT(*),1) AS late_pct, COUNT(*) AS paid_invoices FROM invoices i JOIN vendors v ON v.vendor_id=i.vendor_id WHERE i.status='paid'"}

results/ocr_quality_report.json ADDED Viewed

	@@ -0,0 +1,313 @@

+{
+  "generated_at": 1781326743.1110458,
+  "note": "CER/WER vs .txt sidecar reference; field accuracy vs gt.json. sidecar = reference text source (CER\u22480 by construction).",
+  "models": [
+    {
+      "lab": "OpenBMB",
+      "name": "MiniCPM-V-4.6",
+      "params_b": 8.0,
+      "size_gb_int4": 5.5,
+      "modality": "vision-language (OCR + reasoning)",
+      "role": "OCR backend + LLM extractor",
+      "available": true
+    },
+    {
+      "lab": "OpenBMB",
+      "name": "MiniCPM-o-4.5",
+      "params_b": 8.0,
+      "size_gb_int4": 5.5,
+      "modality": "omni (vision/audio) VLM",
+      "role": "alt OCR/VLM",
+      "available": true
+    },
+    {
+      "lab": "OpenBMB",
+      "name": "MiniCPM3-4B",
+      "params_b": 4.0,
+      "size_gb_int4": 2.8,
+      "modality": "text LLM (reasoning + function-calling, 32k ctx)",
+      "role": "ERP reasoning \u00b7 NLQ\u2192SQL \u00b7 report summarization (fine-tune target)",
+      "available": true
+    },
+    {
+      "lab": "Cohere",
+      "name": "Aya-Vision-8B",
+      "params_b": 8.0,
+      "size_gb_int4": 6.0,
+      "modality": "vision-language (OCR/VQA, 23 langs)",
+      "role": "OCR backend",
+      "available": false
+    },
+    {
+      "lab": "Cohere",
+      "name": "Aya-Vision-32B",
+      "params_b": 32.0,
+      "size_gb_int4": 18.0,
+      "modality": "vision-language (OCR/VQA)",
+      "role": "alt OCR backend (max-quality small)",
+      "available": false
+    },
+    {
+      "lab": "Cohere",
+      "name": "Command R7B",
+      "params_b": 7.0,
+      "size_gb_int4": 5.0,
+      "modality": "text LLM (RAG + tool-use + reasoning, 128k ctx)",
+      "role": "ERP RAG \u00b7 NLQ \u00b7 grounded reasoning",
+      "available": false
+    },
+    {
+      "lab": "Black Forest Labs",
+      "name": "FLUX.1 [dev]",
+      "params_b": 12.0,
+      "size_gb_int4": 12.0,
+      "modality": "text-to-image GENERATION",
+      "role": "synthetic test-document generator (not OCR)",
+      "available": false
+    },
+    {
+      "lab": "Black Forest Labs",
+      "name": "FLUX.1 [schnell]",
+      "params_b": 12.0,
+      "size_gb_int4": 12.0,
+      "modality": "text-to-image GENERATION (fast)",
+      "role": "synthetic test-document generator",
+      "available": false
+    }
+  ],
+  "backends": [
+    {
+      "backend": "minicpm",
+      "model": "MiniCPM-V-4.6",
+      "params_b": 8.0,
+      "size_gb": 5.5,
+      "lab": "OpenBMB",
+      "is_reference": false,
+      "cer": 0.0262,
+      "wer": 0.0876,
+      "field_exact_match": 0.907,
+      "field_f1": 0.9397,
+      "avg_latency_ms": 6524.8167,
+      "avg_cost_usd": 0.0002,
+      "samples_scored": 6,
+      "per_sample": [
+        {
+          "sample": "invoice_scanned_basic",
+          "cer": 0.0,
+          "wer": 0.0,
+          "field_exact": 0.889,
+          "field_f1": 0.9,
+          "latency_ms": 5560.8,
+          "cost_usd": 0.0001952,
+          "confidence": 0.7
+        },
+        {
+          "sample": "receipt_scanned",
+          "cer": 0.0942,
+          "wer": 0.3103,
+          "field_exact": 1.0,
+          "field_f1": 1.0,
+          "latency_ms": 4218.7,
+          "cost_usd": 0.0001883,
+          "confidence": 0.98
+        },
+        {
+          "sample": "po_scanned",
+          "cer": 0.0368,
+          "wer": 0.1277,
+          "field_exact": 1.0,
+          "field_f1": 1.0,
+          "latency_ms": 4404.9,
+          "cost_usd": 0.0001835,
+          "confidence": 0.98
+        },
+        {
+          "sample": "contract_scanned",
+          "cer": 0.0,
+          "wer": 0.0,
+          "field_exact": 0.636,
+          "field_f1": 0.8,
+          "latency_ms": 6532.2,
+          "cost_usd": 0.000166,
+          "confidence": 0.98
+        },
+        {
+          "sample": "subscription_memo_scanned",
+          "cer": 0.0,
+          "wer": 0.0,
+          "field_exact": 0.917,
+          "field_f1": 0.938,
+          "latency_ms": 5010.4,
+          "cost_usd": 0.0001924,
+          "confidence": 0.98
+        },
+        {
+          "sample": "complex_invoice_messy",
+          "cer": null,
+          "wer": null,
+          "field_exact": 1.0,
+          "field_f1": 1.0,
+          "latency_ms": 13421.9,
+          "cost_usd": 0.0004414,
+          "confidence": 0.98
+        }
+      ]
+    },
+    {
+      "backend": "tesseract",
+      "model": "tesseract",
+      "params_b": null,
+      "size_gb": null,
+      "lab": "classic",
+      "is_reference": false,
+      "cer": 0.1468,
+      "wer": 0.1848,
+      "field_exact_match": 0.907,
+      "field_f1": 0.9397,
+      "avg_latency_ms": 3436.8667,
+      "avg_cost_usd": 0.0001,
+      "samples_scored": 6,
+      "per_sample": [
+        {
+          "sample": "invoice_scanned_basic",
+          "cer": 0.1225,
+          "wer": 0.1389,
+          "field_exact": 0.889,
+          "field_f1": 0.9,
+          "latency_ms": 3698.6,
+          "cost_usd": 0.0001242,
+          "confidence": 0.68
+        },
+        {
+          "sample": "receipt_scanned",
+          "cer": 0.4555,
+          "wer": 0.5172,
+          "field_exact": 1.0,
+          "field_f1": 1.0,
+          "latency_ms": 2861.1,
+          "cost_usd": 0.0001207,
+          "confidence": 0.96
+        },
+        {
+          "sample": "po_scanned",
+          "cer": 0.0951,
+          "wer": 0.1489,
+          "field_exact": 1.0,
+          "field_f1": 1.0,
+          "latency_ms": 3390.6,
+          "cost_usd": 0.000118,
+          "confidence": 0.96
+        },
+        {
+          "sample": "contract_scanned",
+          "cer": 0.0,
+          "wer": 0.0,
+          "field_exact": 0.636,
+          "field_f1": 0.8,
+          "latency_ms": 2336.4,
+          "cost_usd": 7.97e-05,
+          "confidence": 0.96
+        },
+        {
+          "sample": "subscription_memo_scanned",
+          "cer": 0.061,
+          "wer": 0.119,
+          "field_exact": 0.917,
+          "field_f1": 0.938,
+          "latency_ms": 2243.8,
+          "cost_usd": 0.0001211,
+          "confidence": 0.96
+        },
+        {
+          "sample": "complex_invoice_messy",
+          "cer": null,
+          "wer": null,
+          "field_exact": 1.0,
+          "field_f1": 1.0,
+          "latency_ms": 6090.7,
+          "cost_usd": 0.0002828,
+          "confidence": 0.96
+        }
+      ]
+    },
+    {
+      "backend": "sidecar",
+      "model": "sidecar",
+      "params_b": null,
+      "size_gb": null,
+      "lab": "classic",
+      "is_reference": true,
+      "cer": 0.0,
+      "wer": 0.0,
+      "field_exact_match": 0.907,
+      "field_f1": 0.9397,
+      "avg_latency_ms": 3235.3167,
+      "avg_cost_usd": 0.0001,
+      "samples_scored": 6,
+      "per_sample": [
+        {
+          "sample": "invoice_scanned_basic",
+          "cer": 0.0,
+          "wer": 0.0,
+          "field_exact": 0.889,
+          "field_f1": 0.9,
+          "latency_ms": 1697.6,
+          "cost_usd": 9.45e-05,
+          "confidence": 0.66
+        },
+        {
+          "sample": "receipt_scanned",
+          "cer": 0.0,
+          "wer": 0.0,
+          "field_exact": 1.0,
+          "field_f1": 1.0,
+          "latency_ms": 2126.2,
+          "cost_usd": 0.0001212,
+          "confidence": 0.94
+        },
+        {
+          "sample": "po_scanned",
+          "cer": 0.0,
+          "wer": 0.0,
+          "field_exact": 1.0,
+          "field_f1": 1.0,
+          "latency_ms": 2194.4,
+          "cost_usd": 0.0001184,
+          "confidence": 0.94
+        },
+        {
+          "sample": "contract_scanned",
+          "cer": 0.0,
+          "wer": 0.0,
+          "field_exact": 0.636,
+          "field_f1": 0.8,
+          "latency_ms": 1522.1,
+          "cost_usd": 7.97e-05,
+          "confidence": 0.94
+        },
+        {
+          "sample": "subscription_memo_scanned",
+          "cer": 0.0,
+          "wer": 0.0,
+          "field_exact": 0.917,
+          "field_f1": 0.938,
+          "latency_ms": 1632.7,
+          "cost_usd": 9.16e-05,
+          "confidence": 0.94
+        },
+        {
+          "sample": "complex_invoice_messy",
+          "cer": null,
+          "wer": null,
+          "field_exact": 1.0,
+          "field_f1": 1.0,
+          "latency_ms": 10238.9,
+          "cost_usd": 0.0002297,
+          "confidence": 0.98
+        }
+      ]
+    }
+  ],
+  "best_ocr_quality": "minicpm",
+  "best_document_analysis": "minicpm"
+}

scripts/finetune_erp.py ADDED Viewed

	@@ -0,0 +1,153 @@

+#!/usr/bin/env python3
+"""Fine-tune a small model on the simulated ERP domain — two backends, one dataset.
+    python scripts/finetune_erp.py                # offline CPU demo (default) — runs anywhere
+    python scripts/finetune_erp.py --backend hf   # real LoRA on OpenBMB MiniCPM3-4B (needs GPU)
+Outputs (committed/published):
+    backend/finetune/erp_sft.jsonl            instruction-tuning dataset from the ERP KB
+    backend/finetune/erp_finetune_report.json before→after metrics + loss curve  (served at
+                                              /api/erp/finetune-report and shown in the UI)
+    backend/finetune/runs/<ts>/               per-run snapshot
+The `hf` backend builds the exact PEFT/TRL SFTTrainer config for MiniCPM3-4B and (if
+torch+peft+trl are installed and a GPU is present) runs it; otherwise it writes the
+ready-to-run recipe so it can be launched on a GPU box / HF Space / Colab unchanged.
+"""
+from __future__ import annotations
+import argparse
+import json
+import sys
+import time
+from pathlib import Path
+ROOT = Path(__file__).resolve().parent.parent
+sys.path.insert(0, str(ROOT / "backend"))
+from app.config import get_settings  # noqa: E402
+from app.erp.finetune import build_dataset, run_offline_finetune  # noqa: E402
+FT_DIR = ROOT / "backend" / "finetune"
+BASE_MODEL = "openbmb/MiniCPM3-4B"
+def _lora_recipe(jsonl: Path) -> dict:
+    """The production recipe: LoRA SFT of OpenBMB MiniCPM3-4B on the ERP dataset."""
+    return {
+        "base_model": BASE_MODEL,
+        "method": "LoRA (PEFT) supervised fine-tuning (TRL SFTTrainer)",
+        "dataset": str(jsonl.relative_to(ROOT)),
+        "prompt_template": "{instruction}\n\nERP question: {input}\nSQL:",
+        "hyperparams": {
+            "lora_r": 16, "lora_alpha": 32, "lora_dropout": 0.05,
+            "target_modules": ["q_proj", "k_proj", "v_proj", "o_proj"],
+            "learning_rate": 2e-4, "num_train_epochs": 3, "per_device_train_batch_size": 8,
+            "gradient_accumulation_steps": 2, "max_seq_length": 1024, "bf16": True,
+        },
+        "command": "python scripts/finetune_erp.py --backend hf",
+        "requirements": ["torch", "transformers>=4.44", "peft", "trl", "accelerate", "datasets"],
+    }
+def _run_hf(jsonl: Path, settings) -> dict:
+    """Run a real LoRA SFT if the stack is present; else emit the runnable recipe."""
+    recipe = _lora_recipe(jsonl)
+    try:
+        import torch  # noqa
+        from datasets import load_dataset  # noqa
+        from peft import LoraConfig  # noqa
+        from transformers import AutoModelForCausalLM, AutoTokenizer  # noqa
+        from trl import SFTConfig, SFTTrainer  # noqa
+    except Exception as e:
+        return {"backend": "hf", "ran": False, "reason": f"training stack unavailable ({e})",
+                "recipe": recipe,
+                "note": "Dataset + recipe are ready; launch on a GPU box to fine-tune MiniCPM3-4B."}
+    import torch
+    from datasets import load_dataset
+    from peft import LoraConfig
+    from transformers import AutoModelForCausalLM, AutoTokenizer
+    from trl import SFTConfig, SFTTrainer
+    tok = AutoTokenizer.from_pretrained(BASE_MODEL, trust_remote_code=True)
+    model = AutoModelForCausalLM.from_pretrained(
+        BASE_MODEL, trust_remote_code=True,
+        torch_dtype=torch.bfloat16 if torch.cuda.is_available() else torch.float32,
+        device_map="auto")
+    ds = load_dataset("json", data_files=str(jsonl), split="train")
+    def fmt(ex):
+        return {"text": f"{ex['instruction']}\n\nERP question: {ex['input']}\nSQL: {ex['output']}{tok.eos_token}"}
+    ds = ds.map(fmt)
+    out = FT_DIR / "runs" / f"hf_{time.strftime('%Y%m%dT%H%M%S')}"
+    trainer = SFTTrainer(
+        model=model,
+        train_dataset=ds,
+        peft_config=LoraConfig(**{k: recipe["hyperparams"][k] for k in
+                                  ("lora_r", "lora_alpha", "lora_dropout", "target_modules")},
+                               task_type="CAUSAL_LM"),
+        args=SFTConfig(output_dir=str(out), num_train_epochs=3, per_device_train_batch_size=8,
+                       learning_rate=2e-4, logging_steps=10, max_seq_length=1024,
+                       bf16=torch.cuda.is_available()))
+    res = trainer.train()
+    trainer.save_model(str(out))
+    return {"backend": "hf", "ran": True, "adapter_dir": str(out),
+            "train_loss": float(getattr(res, "training_loss", 0.0)), "recipe": recipe}
+def main() -> None:
+    ap = argparse.ArgumentParser()
+    ap.add_argument("--backend", choices=["local", "hf"], default="local")
+    ap.add_argument("--epochs", type=int, default=400)
+    args = ap.parse_args()
+    settings = get_settings()
+    FT_DIR.mkdir(parents=True, exist_ok=True)
+    (FT_DIR / "runs").mkdir(exist_ok=True)
+    # 1) build + write the shared instruction-tuning dataset
+    data = build_dataset()
+    jsonl = FT_DIR / "erp_sft.jsonl"
+    jsonl.write_text("\n".join(json.dumps(r) for r in data) + "\n")
+    # 2) train
+    if args.backend == "local":
+        result = run_offline_finetune(settings, epochs=args.epochs)
+        result["backend"] = "local"
+        result["dataset_jsonl"] = str(jsonl.relative_to(ROOT))
+        result["production_recipe"] = _lora_recipe(jsonl)
+    else:
+        result = _run_hf(jsonl, settings)
+        # always include the offline metrics too, so the UI has a populated report
+        result["offline_demo"] = run_offline_finetune(settings, epochs=args.epochs)
+    result["base_model_for_production"] = BASE_MODEL
+    result["generated_at"] = time.time()
+    # 3) publish
+    report = FT_DIR / "erp_finetune_report.json"
+    report.write_text(json.dumps(result, indent=2))
+    snap = FT_DIR / "runs" / f"{args.backend}_{time.strftime('%Y%m%dT%H%M%S')}.json"
+    snap.write_text(json.dumps(result, indent=2))
+    # 4) print a readout
+    r = result if args.backend == "local" else result.get("offline_demo", {})
+    print("\n" + "=" * 78)
+    print(" ERP DOMAIN FINE-TUNE  (backend: %s)" % args.backend)
+    print("=" * 78)
+    print(f" dataset           : {len(data)} examples → {jsonl.relative_to(ROOT)}")
+    print(f" production target : {BASE_MODEL}  (LoRA recipe emitted)")
+    if r:
+        print(f" offline trainer   : {r['model']}")
+        print(f"   classes={r['n_classes']}  train={r['train']}  test={r['test']}  params={r['trainable_params']:,}")
+        print(f"   BEFORE test-acc : {r['before_test_accuracy']*100:5.1f}%")
+        print(f"   AFTER  test-acc : {r['after_test_accuracy']*100:5.1f}%   (+{r['accuracy_gain']*100:.1f} pts)")
+        print(f"   routed-SQL exec : {r['routed_sql_exec_rate']*100:.1f}%   final loss={r['final_loss']}")
+    print(f" published         : {report.relative_to(ROOT)}")
+    print("=" * 78 + "\n")
+if __name__ == "__main__":
+    main()

scripts/generate_extreme_docs.py ADDED Viewed

	@@ -0,0 +1,421 @@

+#!/usr/bin/env python3
+"""Generate EXTREMELY hard OCR documents — embedded images + heavy degradation:
+  1. extreme_receipt_photo   — thermal receipt PHOTOGRAPHED on a desk: perspective
+     warp, uneven lighting, shadow, crinkle lines, faded thermal band, printed logo.
+  2. extreme_po_collage      — image-heavy purchase order: product THUMBNAIL IMAGES
+     in table rows, QR code, barcode, rotated APPROVED stamp over the table,
+     signature scribble, misaligned columns.
+  3. extreme_contract_fax    — dense two-column contract received BY FAX: low
+     contrast, salt-and-pepper noise, skew, scanline streaks, punch-hole shadows,
+     handwritten blue margin note, red RECEIVED stamp.
+Each writes <id>.png + <id>.gt.json + <id>.txt (sidecar reference text, drawn from
+the SAME strings as the image so CER/WER is fair). All are tagged skip_eval so the
+main eval harness is unchanged; the OCR quality benchmark picks them up.
+"""
+from __future__ import annotations
+import json
+import math
+import random
+from pathlib import Path
+import numpy as np
+from PIL import Image, ImageDraw, ImageFilter, ImageFont
+ROOT = Path(__file__).resolve().parent.parent
+OUT = ROOT / "backend" / "evals" / "datasets"
+rng = random.Random(42)
+def font(sz, bold=False, mono=False):
+    paths = (["/System/Library/Fonts/Supplemental/Courier New Bold.ttf",
+              "/System/Library/Fonts/Supplemental/Courier New.ttf"] if mono else []) + [
+        "/System/Library/Fonts/Supplemental/Arial Bold.ttf" if bold else "/System/Library/Fonts/Supplemental/Arial.ttf",
+        "/System/Library/Fonts/Helvetica.ttc",
+        "/Library/Fonts/Arial.ttf",
+    ]
+    for p in paths:
+        try:
+            return ImageFont.truetype(p, sz)
+        except Exception:
+            continue
+    return ImageFont.load_default()
+def _find_coeffs(dst, src):
+    """Perspective coefficients so that src corners land on dst corners."""
+    A, B = [], []
+    for (x, y), (u, v) in zip(dst, src):
+        A.append([x, y, 1, 0, 0, 0, -u * x, -u * y])
+        A.append([0, 0, 0, x, y, 1, -v * x, -v * y])
+        B.extend([u, v])
+    res, *_ = np.linalg.lstsq(np.array(A, float), np.array(B, float), rcond=None)
+    return res.tolist()
+def stamp(text, color, angle, size=(360, 120), fsz=34):
+    im = Image.new("RGBA", size, (0, 0, 0, 0))
+    d = ImageDraw.Draw(im)
+    d.rounded_rectangle([4, 4, size[0] - 4, size[1] - 4], radius=16, outline=color + (190,), width=5)
+    f = font(fsz, bold=True)
+    tw = d.textlength(text, font=f)
+    d.text(((size[0] - tw) / 2, (size[1] - fsz) / 2 - 6), text, font=f, fill=color + (190,))
+    return im.rotate(angle, expand=True, resample=Image.BICUBIC)
+def signature(w=220, h=60, color=(25, 30, 120)):
+    im = Image.new("RGBA", (w, h), (0, 0, 0, 0))
+    d = ImageDraw.Draw(im)
+    pts = []
+    for i in range(60):
+        t = i / 59
+        x = 8 + t * (w - 16)
+        y = h / 2 + math.sin(t * 14 + 1.3) * (h / 3) * (1 - 0.5 * t) + rng.uniform(-2, 2)
+        pts.append((x, y))
+    d.line(pts, fill=color + (230,), width=3, joint="curve")
+    return im
+# ── 1. extreme_receipt_photo ──────────────────────────────────────────────────
+R_LINES = [
+    "BREW & BEAN COFFEE Co.",
+    "412 Harbor Lane, Portland OR",
+    "Receipt #R-88341  Reg 02",
+    "Date: 2026-06-02  14:37",
+    "Currency: USD",
+    "--------------------------------",
+    "Flat White         2 x 4.75   9.50",
+    "Butter Croissant   3 x 3.25   9.75",
+    "Cold Brew Growler  1 x 14.00 14.00",
+    "Loyalty discount             -2.50",
+    "--------------------------------",
+    "Subtotal                     30.75",
+    "Tax 8.8%                      2.71",
+    "TOTAL                        33.46",
+    "Payment: VISA ****4421",
+    "--------------------------------",
+    "Thank you! brewandbean.example",
+]
+R_GT = {
+    "doc_type": "receipt",
+    "merchant": "BREW & BEAN COFFEE Co.",
+    "date": "2026-06-02",
+    "currency": "USD",
+    "subtotal": 30.75,
+    "tax_amount": 2.71,
+    "total": 33.46,
+    "payment_method": "VISA ****4421",
+    "line_items": [
+        {"description": "Flat White", "quantity": 2, "unit_price": 4.75, "line_total": 9.50},
+        {"description": "Butter Croissant", "quantity": 3, "unit_price": 3.25, "line_total": 9.75},
+        {"description": "Cold Brew Growler", "quantity": 1, "unit_price": 14.00, "line_total": 14.00},
+    ],
+    "_meta": {"doc_type": "receipt", "channel": "photo", "difficulty": "extreme", "skip_eval": True},
+}
+def gen_receipt():
+    pw, ph = 560, 1010
+    paper = Image.new("RGBA", (pw, ph), (250, 248, 242, 255))
+    d = ImageDraw.Draw(paper)
+    # printed logo: filled coffee-cup glyph in a ring
+    cx, cy = pw // 2, 64
+    d.ellipse([cx - 44, cy - 44, cx + 44, cy + 44], outline=(60, 50, 45), width=4)
+    d.rounded_rectangle([cx - 20, cy - 14, cx + 14, cy + 22], radius=5, fill=(60, 50, 45))
+    d.arc([cx + 8, cy - 8, cx + 30, cy + 14], 270, 90, fill=(60, 50, 45), width=4)
+    fm = font(24, mono=True)
+    y = 130
+    for ln in R_LINES:
+        w = d.textlength(ln, font=fm)
+        x = (pw - w) / 2 if not ln.startswith(("Flat", "Butter", "Cold", "Loyal", "Subt", "Tax", "TOTAL", "Paym")) else 28
+        d.text((x, y), ln, font=fm, fill=(40, 38, 36))
+        y += 36
+    d.line([(0, ph - 14), (pw, ph - 6)], fill=(250, 248, 242, 0))  # keep bottom edge clean
+    # crinkle lines
+    for _ in range(7):
+        x0 = rng.randint(0, pw)
+        d.line([(x0, 0), (x0 + rng.randint(-90, 90), ph)], fill=(208, 204, 196, 90), width=2)
+    # faded thermal band (blend toward white)
+    arr = np.asarray(paper).astype(np.float32)
+    y0, y1 = 430, 560
+    fade = arr[y0:y1, :, :3]
+    arr[y0:y1, :, :3] = fade + (255 - fade) * 0.55
+    paper = Image.fromarray(arr.astype(np.uint8))
+    # desk background with wood grain + vignette
+    W, H = 1000, 1400
+    desk = Image.new("RGB", (W, H), (96, 74, 54))
+    dd = ImageDraw.Draw(desk)
+    for yy in range(0, H, 7):
+        dd.line([(0, yy), (W, yy + rng.randint(-3, 3))],
+                fill=(96 + rng.randint(-10, 8), 74 + rng.randint(-8, 6), 54 + rng.randint(-6, 6)), width=3)
+    # shadow under receipt
+    sh = Image.new("RGBA", (W, H), (0, 0, 0, 0))
+    ImageDraw.Draw(sh).polygon([(232, 152), (798, 198), (742, 1292), (172, 1232)], fill=(0, 0, 0, 110))
+    desk.paste(Image.new("RGB", (W, H), 0), (0, 0), sh.filter(ImageFilter.GaussianBlur(18)))
+    # perspective-warp the receipt onto the desk
+    dst = [(248, 138), (786, 186), (730, 1276), (188, 1218)]
+    coeffs = _find_coeffs(dst, [(0, 0), (pw, 0), (pw, ph), (0, ph)])
+    warped = paper.transform((W, H), Image.PERSPECTIVE, coeffs, Image.BICUBIC)
+    desk.paste(warped, (0, 0), warped)
+    # uneven lighting: bright top-left, dim bottom-right + vignette
+    a = np.asarray(desk).astype(np.float32)
+    yy, xx = np.mgrid[0:H, 0:W]
+    light = 1.12 - 0.32 * ((xx / W) * 0.6 + (yy / H) * 0.4)
+    r2 = ((xx - W / 2) / (W / 2)) ** 2 + ((yy - H / 2) / (H / 2)) ** 2
+    light *= 1 - 0.18 * np.clip(r2 - 0.45, 0, 1)
+    a *= light[..., None]
+    a += np.random.default_rng(7).normal(0, 4.5, a.shape)
+    img = Image.fromarray(np.clip(a, 0, 255).astype(np.uint8)).filter(ImageFilter.GaussianBlur(0.6))
+    img.save(OUT / "extreme_receipt_photo.png")
+    (OUT / "extreme_receipt_photo.txt").write_text("\n".join(R_LINES) + "\n")
+    (OUT / "extreme_receipt_photo.gt.json").write_text(json.dumps(R_GT, indent=2))
+# ── 2. extreme_po_collage ─────────────────────────────────────────────────────
+PO_ITEMS = [
+    ("SHELF UNIT S-200 heavy gauge", 24, 189.00, 4536.00),
+    ("LED STRIP 2m retail white", 60, 22.40, 1344.00),
+    ("ENDCAP DISPLAY birch finish", 12, 310.00, 3720.00),
+]
+PO_GT = {
+    "doc_type": "purchase_order",
+    "order_number": "PO-77RX-3309",
+    "order_date": "2026-05-21",
+    "delivery_date": "2026-06-15",
+    "vendor_name": "Nordic Fixture Works AB",
+    "buyer_name": "Aperture Retail Group",
+    "ship_to": "DC-7, 4420 Logistics Pkwy, Columbus OH",
+    "currency": "USD",
+    "payment_terms": "Net 45",
+    "subtotal": 9600.00,
+    "tax_amount": 792.00,
+    "total": 10392.00,
+    "line_items": [{"description": d_, "quantity": q, "unit_price": u, "line_total": t}
+                   for d_, q, u, t in PO_ITEMS],
+    "_meta": {"doc_type": "purchase_order", "channel": "scanned", "difficulty": "extreme", "skip_eval": True},
+}
+def _thumb(kind):
+    im = Image.new("RGB", (76, 76), (235, 238, 242))
+    d = ImageDraw.Draw(im)
+    if kind == 0:  # shelf unit
+        for i in range(4):
+            d.rectangle([10, 12 + i * 15, 66, 18 + i * 15], fill=(120, 128, 140))
+        d.line([(12, 12), (12, 66)], fill=(80, 86, 96), width=3)
+        d.line([(64, 12), (64, 66)], fill=(80, 86, 96), width=3)
+    elif kind == 1:  # LED strip
+        d.rounded_rectangle([8, 30, 68, 46], radius=8, fill=(60, 64, 70))
+        for x in range(14, 66, 9):
+            d.ellipse([x, 34, x + 6, 42], fill=(255, 240, 160))
+    else:  # endcap display
+        d.polygon([(14, 64), (26, 14), (50, 14), (62, 64)], fill=(196, 164, 120))
+        d.rectangle([20, 40, 56, 46], fill=(160, 128, 88))
+        d.rectangle([24, 26, 52, 32], fill=(160, 128, 88))
+    d.rectangle([0, 0, 75, 75], outline=(150, 150, 150))
+    return im
+def _qr(d, x, y, n=21, cell=5):
+    g = random.Random(9)
+    for r in range(n):
+        for c in range(n):
+            if g.random() < 0.45:
+                d.rectangle([x + c * cell, y + r * cell, x + c * cell + cell - 1, y + r * cell + cell - 1], fill=0)
+    for fx, fy in [(0, 0), (n - 7, 0), (0, n - 7)]:  # finder squares
+        d.rectangle([x + fx * cell, y + fy * cell, x + (fx + 7) * cell, y + (fy + 7) * cell], outline=0, width=3)
+        d.rectangle([x + (fx + 2) * cell, y + (fy + 2) * cell, x + (fx + 5) * cell, y + (fy + 5) * cell], fill=0)
+def gen_po():
+    W, H = 1240, 1600
+    im = Image.new("RGB", (W, H), (252, 252, 250))
+    d = ImageDraw.Draw(im)
+    h1, h2, h3, body, small = font(40, True), font(22, True), font(18, True), font(19), font(15)
+    # header: drawn logo + vendor (left), meta box (right), QR top-right corner
+    d.rectangle([40, 40, 120, 120], fill=(30, 90, 160))
+    d.polygon([(52, 108), (80, 52), (108, 108)], fill=(252, 252, 250))
+    d.text((136, 48), "Nordic Fixture Works AB", font=h2, fill=(20, 20, 30))
+    d.text((136, 80), "Industrigatan 14, Malmo SE  ·  VAT SE5566778899", font=small, fill=(90, 90, 100))
+    d.text((40, 150), "PURCHASE ORDER", font=h1, fill=(30, 90, 160))
+    _qr(d, 1060, 40)
+    meta = [("PO Number:", "PO-77RX-3309"), ("Order Date:", "2026-05-21"),
+            ("Delivery Date:", "2026-06-15"), ("Payment Terms:", "Net 45"), ("Currency:", "USD")]
+    d.rounded_rectangle([720, 150, 1200, 320], radius=10, outline=(30, 90, 160), width=2)
+    for i, (k, v) in enumerate(meta):
+        d.text((740, 165 + i * 30), k, font=h3, fill=(90, 90, 100))
+        d.text((920, 165 + i * 30), v, font=body, fill=(20, 20, 30))
+    d.text((40, 230), "Buyer: Aperture Retail Group", font=body, fill=(20, 20, 30))
+    d.text((40, 260), "Ship To: DC-7, 4420 Logistics Pkwy, Columbus OH", font=body, fill=(20, 20, 30))
+    # table with thumbnails + deliberately misaligned columns
+    d.rectangle([40, 360, 1200, 404], fill=(30, 90, 160))
+    for x, t in [(56, "IMG"), (160, "DESCRIPTION"), (700, "QTY"), (840, "UNIT USD"), (1040, "AMOUNT")]:
+        d.text((x, 370), t, font=h3, fill=(255, 255, 255))
+    y = 420
+    for i, (desc, qty, unit, tot) in enumerate(PO_ITEMS):
+        off = [-14, 22, 6][i]  # column misalignment per row
+        im.paste(_thumb(i), (52, y))
+        d.text((160 + off, y + 24), desc, font=body, fill=(25, 25, 30))
+        d.text((706 + off // 2, y + 24), str(qty), font=body, fill=(25, 25, 30))
+        d.text((846 - off, y + 24), f"{unit:,.2f}", font=body, fill=(25, 25, 30))
+        d.text((1042 + off, y + 24), f"{tot:,.2f}", font=body, fill=(25, 25, 30))
+        d.line([(40, y + 88), (1200, y + 88)], fill=(210, 210, 215))
+        y += 96
+    # totals (right) + barcode (left) + signature
+    d.text((840, y + 24), "Subtotal:", font=h3, fill=(90, 90, 100)); d.text((1042, y + 24), "9,600.00", font=body, fill=(20, 20, 30))
+    d.text((840, y + 58), "Tax 8.25%:", font=h3, fill=(90, 90, 100)); d.text((1042, y + 58), "792.00", font=body, fill=(20, 20, 30))
+    d.rectangle([820, y + 92, 1200, y + 134], fill=(240, 244, 250))
+    d.text((840, y + 100), "TOTAL:", font=h2, fill=(30, 90, 160)); d.text((1042, y + 100), "10,392.00 USD", font=h2, fill=(30, 90, 160))
+    bx = 40
+    g = random.Random(5)
+    for _ in range(60):
+        wbar = g.choice((2, 2, 3, 5))
+        d.rectangle([bx, y + 40, bx + wbar, y + 110], fill=0)
+        bx += wbar + g.choice((2, 3))
+    d.text((40, y + 116), "*PO77RX3309*", font=small, fill=(60, 60, 60))
+    sig = signature()
+    im.paste(sig, (760, H - 220), sig)
+    d.line([(740, H - 160), (1010, H - 160)], fill=(60, 60, 60), width=2)
+    d.text((740, H - 150), "Authorized — K. Lindqvist, Procurement", font=small, fill=(60, 60, 60))
+    # green APPROVED stamp overlapping the table
+    st = stamp("APPROVED · OPS DESK", (20, 130, 60), 12)
+    im.paste(st, (430, 560), st)
+    # mild scan noise + tiny skew
+    a = np.asarray(im).astype(np.float32) + np.random.default_rng(3).normal(0, 5, (H, W, 3))
+    im = Image.fromarray(np.clip(a, 0, 255).astype(np.uint8)).rotate(-0.7, expand=False, fillcolor=(252, 252, 250))
+    im.save(OUT / "extreme_po_collage.png")
+    txt = ["PURCHASE ORDER", "Nordic Fixture Works AB", "Industrigatan 14, Malmo SE",
+           "PO Number: PO-77RX-3309", "Order Date: 2026-05-21", "Delivery Date: 2026-06-15",
+           "Payment Terms: Net 45", "Currency: USD",
+           "Buyer: Aperture Retail Group", "Ship To: DC-7, 4420 Logistics Pkwy, Columbus OH",
+           "IMG DESCRIPTION QTY UNIT USD AMOUNT"] + [
+           f"{desc} {q} {u:,.2f} {t:,.2f}" for desc, q, u, t in PO_ITEMS] + [
+           "Subtotal: 9,600.00", "Tax 8.25%: 792.00", "TOTAL: 10,392.00 USD",
+           "*PO77RX3309*", "APPROVED · OPS DESK", "Authorized — K. Lindqvist, Procurement"]
+    (OUT / "extreme_po_collage.txt").write_text("\n".join(txt) + "\n")
+    (OUT / "extreme_po_collage.gt.json").write_text(json.dumps(PO_GT, indent=2))
+# ── 3. extreme_contract_fax ───────────────────────────────────────────────────
+C_GT = {
+    "doc_type": "contract",
+    "contract_number": "MSA-2026-0481",
+    "title": "Master Services Agreement - Store Fit-Out Program",
+    "party_a": "Aperture Retail Group",
+    "party_b": "Halcyon Build Partners LLC",
+    "effective_date": "2026-03-01",
+    "expiration_date": "2029-02-28",
+    "contract_value": 1250000.00,
+    "currency": "USD",
+    "governing_law": "State of Ohio",
+    "auto_renew": False,
+    "termination_notice_days": 60,
+    "_meta": {"doc_type": "contract", "channel": "fax", "difficulty": "extreme", "skip_eval": True},
+}
+C_HEAD = [
+    "MASTER SERVICES AGREEMENT - STORE FIT-OUT PROGRAM",
+    "Contract No: MSA-2026-0481",
+    "Party A: Aperture Retail Group   Party B: Halcyon Build Partners LLC",
+    "Effective Date: 2026-03-01   Expiration Date: 2029-02-28",
+    "Total Contract Value: USD 1,250,000.00   Governing Law: State of Ohio",
+    "Auto-Renewal: NO   Termination Notice: 60 days written notice",
+]
+C_BODY = [
+    "1. SCOPE. Contractor shall furnish all labor, materials, supervision and",
+    "equipment required for the fit-out of retail premises identified in each",
+    "Statement of Work executed under this Agreement.",
+    "2. TERM. This Agreement commences on the Effective Date and continues",
+    "until the Expiration Date unless terminated earlier per Section 9.",
+    "3. COMPENSATION. Client shall pay Contractor fees not to exceed the",
+    "Total Contract Value, payable per approved milestone invoices Net 30.",
+    "4. CHANGE ORDERS. No variation is binding unless documented in a",
+    "written change order signed by both parties' authorized representatives.",
+    "5. WARRANTIES. Contractor warrants workmanship free of defects for",
+    "twenty-four (24) months following practical completion of each site.",
+    "6. INSURANCE. Contractor shall maintain commercial general liability",
+    "coverage of not less than USD 5,000,000 per occurrence.",
+    "7. CONFIDENTIALITY. Each party shall protect Confidential Information",
+    "with no less than reasonable care and use it solely for this Agreement.",
+    "8. LIABILITY. Neither party is liable for indirect or consequential",
+    "damages; aggregate liability is capped at the Total Contract Value.",
+    "9. TERMINATION. Either party may terminate for convenience upon sixty",
+    "(60) days written notice, or immediately for uncured material breach.",
+    "10. GOVERNING LAW. This Agreement is governed by the laws of the",
+    "State of Ohio, excluding its conflict of law provisions.",
+]
+def gen_contract():
+    W, H = 1240, 1600
+    im = Image.new("RGB", (W, H), (255, 255, 255))
+    d = ImageDraw.Draw(im)
+    fh, fb, fs = font(26, True), font(17), font(14)
+    d.text((30, 18), "FAX  TX 06/12/2026 14:22  FROM HALCYON BUILD +1 614 555 0188  P.01/07", font=fs, fill=(60, 60, 60))
+    d.line([(30, 44), (1210, 44)], fill=(60, 60, 60), width=2)
+    tw = d.textlength(C_HEAD[0], font=fh)
+    d.text(((W - tw) / 2, 70), C_HEAD[0], font=fh, fill=(15, 15, 15))
+    y = 130
+    for ln in C_HEAD[1:]:
+        d.text((80, y), ln, font=fb, fill=(20, 20, 20))
+        y += 30
+    d.line([(60, y + 8), (1180, y + 8)], fill=(120, 120, 120), width=2)
+    # dense two-column body
+    half = (len(C_BODY) + 1) // 2
+    for col, lines in enumerate((C_BODY[:half], C_BODY[half:])):
+        x = 70 + col * 590
+        yy = y + 34
+        for ln in lines:
+            d.text((x, yy), ln, font=fs, fill=(25, 25, 25))
+            yy += 24
+        for extra in range(14):  # filler legalese to densify
+            d.text((x, yy), f"{'WHEREAS the parties acknowledge the recitals set forth herein;'[: 58 - (extra % 3) * 4]}",
+                   font=fs, fill=(45, 45, 45))
+            yy += 24
+    # signature block
+    sy = H - 300
+    for col, (name, role) in enumerate([("M. Okafor — Aperture Retail Group", "Chief Procurement Officer"),
+                                        ("D. Reyes — Halcyon Build Partners LLC", "Managing Partner")]):
+        x = 90 + col * 600
+        sig = signature(color=(20, 20, 20))
+        im.paste(sig, (x, sy), sig)
+        d.line([(x, sy + 70), (x + 420, sy + 70)], fill=(40, 40, 40), width=2)
+        d.text((x, sy + 80), name, font=fs, fill=(30, 30, 30))
+        d.text((x, sy + 102), role, font=fs, fill=(90, 90, 90))
+    # handwritten blue margin note + red stamp
+    note = Image.new("RGBA", (430, 60), (0, 0, 0, 0))
+    ImageDraw.Draw(note).text((0, 8), "legal OK -> route to CFO  (June 5)", font=font(24), fill=(28, 40, 160, 220))
+    note = note.rotate(-3, expand=True, resample=Image.BICUBIC)
+    im.paste(note, (700, 360), note)
+    st = stamp("RECEIVED JUN 05 2026", (180, 30, 30), -14, size=(420, 110), fsz=30)
+    im.paste(st, (90, 430), st)
+    # fax degradation: low contrast, salt & pepper, scanline streaks, skew, punch holes
+    g = im.convert("L")
+    a = np.asarray(g).astype(np.float32)
+    a = 255 - (255 - a) * 0.62                      # washed-out toner
+    nz = np.random.default_rng(11)
+    a += nz.normal(0, 9, a.shape)
+    pepper = nz.random(a.shape)
+    a[pepper < 0.004] = 30                          # pepper
+    a[pepper > 0.997] = 245                         # salt
+    for yy in range(0, H, 90):                      # scanline streaks
+        a[yy:yy + 2, :] = np.clip(a[yy:yy + 2, :] * 1.25, 0, 255)
+    img = Image.fromarray(np.clip(a, 0, 255).astype(np.uint8)).rotate(1.3, expand=False, fillcolor=235)
+    d2 = ImageDraw.Draw(img)
+    for hy in (H // 4, 3 * H // 4):                 # punch-hole shadows
+        d2.ellipse([18, hy - 22, 62, hy + 22], fill=246, outline=140, width=3)
+    img.convert("RGB").save(OUT / "extreme_contract_fax.png")
+    (OUT / "extreme_contract_fax.txt").write_text("\n".join(C_HEAD + C_BODY) + "\n")
+    (OUT / "extreme_contract_fax.gt.json").write_text(json.dumps(C_GT, indent=2))
+if __name__ == "__main__":
+    OUT.mkdir(parents=True, exist_ok=True)
+    gen_receipt()
+    gen_po()
+    gen_contract()
+    for sid in ("extreme_receipt_photo", "extreme_po_collage", "extreme_contract_fax"):
+        print(f"  wrote {OUT / sid}.png (+ .gt.json + .txt)")

scripts/ocr_quality.py ADDED Viewed

	@@ -0,0 +1,67 @@

+#!/usr/bin/env python3
+"""Run the OCR output-quality + document-analysis benchmark across all available
+backends (OpenBMB MiniCPM-V, Cohere Aya-Vision, Tesseract, sidecar) and PUBLISH the
+results.
+    python scripts/ocr_quality.py
+Writes:
+  backend/evals/ocr_quality_report.json   (committed, tracked)
+  <writable>/metrics_snapshots/ocr_quality_<ts>.json  (published snapshot)
+"""
+from __future__ import annotations
+import json
+import sys
+import time
+from pathlib import Path
+ROOT = Path(__file__).resolve().parent.parent
+sys.path.insert(0, str(ROOT / "backend"))
+from app.config import get_settings  # noqa: E402
+from app.db import Database  # noqa: E402
+from app.metrics import MetricsStore  # noqa: E402
+from app.ocr.backends import build_ocr_registry  # noqa: E402
+from app.ocr.quality import run_ocr_quality  # noqa: E402
+from app.providers import build_registry  # noqa: E402
+from app.rag_store import VectorStore  # noqa: E402
+from app.router import ModelRouter  # noqa: E402
+REPORT = ROOT / "backend" / "evals" / "ocr_quality_report.json"
+def main() -> None:
+    s = get_settings()
+    metrics = MetricsStore(s.metrics_db_path)
+    router = ModelRouter(build_registry(s), s, metrics)
+    ocr = build_ocr_registry(s)
+    db = Database(s.app_db_path)
+    rag = VectorStore(s.rag_db_path)
+    report = run_ocr_quality(s, ocr, router, metrics, db=db, rag_store=rag)
+    REPORT.write_text(json.dumps(report, indent=2))
+    snap_dir = s.writable_dir / "metrics_snapshots"
+    snap_dir.mkdir(parents=True, exist_ok=True)
+    (snap_dir / f"ocr_quality_{time.strftime('%Y%m%dT%H%M%S')}.json").write_text(json.dumps(report, indent=2))
+    pct = lambda v: "n/a" if v is None else f"{v*100:.1f}%"
+    print("\n" + "=" * 90)
+    print(" OCR OUTPUT QUALITY + DOCUMENT ANALYSIS  (smaller CER/WER = better; higher field-acc = better)")
+    print("=" * 90)
+    print(f" {'backend':<11}{'model':<17}{'params':>7}{'CER':>8}{'WER':>8}{'field-exact':>13}{'F1':>8}{'lat(ms)':>9}{'$/doc':>9}")
+    print("-" * 90)
+    for r in report["backends"]:
+        params = f"{r['params_b']}B" if r.get("params_b") else "—"
+        print(f" {r['backend']:<11}{(r.get('model') or '')[:16]:<17}{params:>7}"
+              f"{pct(r['cer']):>8}{pct(r['wer']):>8}{pct(r['field_exact_match']):>13}"
+              f"{pct(r['field_f1']):>8}{(r['avg_latency_ms'] or 0):>9.0f}{(r['avg_cost_usd'] or 0):>9.5f}")
+    print("-" * 90)
+    print(f" best OCR text quality   : {report['best_ocr_quality']}")
+    print(f" best document analysis  : {report['best_document_analysis']}")
+    print(f" published → {REPORT}")
+    print("=" * 90 + "\n")
+if __name__ == "__main__":
+    main()

scripts/ocr_smoke.py ADDED Viewed

	@@ -0,0 +1,54 @@

+#!/usr/bin/env python3
+"""Run every available OCR backend against real scanned samples and write a
+tracked report (backend/evals/ocr_backend_report.json).
+    python scripts/ocr_smoke.py
+Reads backend/.env, so configured backends (e.g. MiniCPM) are exercised live.
+Unavailable backends (missing deps/keys) are recorded with the reason.
+"""
+from __future__ import annotations
+import json
+import sys
+from pathlib import Path
+ROOT = Path(__file__).resolve().parent.parent
+sys.path.insert(0, str(ROOT / "backend"))
+from app.config import get_settings  # noqa: E402
+from app.ocr.backends import build_ocr_registry  # noqa: E402
+from app.ocr.backends.healthcheck import run_ocr_backend_tests  # noqa: E402
+REPORT_PATH = ROOT / "backend" / "evals" / "ocr_backend_report.json"
+def main() -> None:
+    s = get_settings()
+    reg = build_ocr_registry(s)
+    report = run_ocr_backend_tests(s, reg)
+    REPORT_PATH.write_text(json.dumps(report, indent=2))
+    print("\n" + "=" * 78)
+    print(f" OCR BACKEND REAL-EXTRACTION REPORT   (mode={report['mode']})")
+    print("=" * 78)
+    print(f" {'backend':<12}{'tier':<8}{'available':<11}{'functional':<11}{'engine / reason'}")
+    print("-" * 78)
+    for b in report["backends"]:
+        if b["available"]:
+            case = b["cases"][0] if b["cases"] else {}
+            detail = f"{case.get('engine','')}  ({case.get('chars',0)} chars, {case.get('latency_ms',0)}ms)"
+            func = "✓ yes" if b["ok"] else "✗ no"
+        else:
+            detail = b["requires"]
+            func = "—"
+        print(f" {b['name']:<12}{b['tier']:<8}{('yes' if b['available'] else 'no'):<11}{func:<11}{detail[:42]}")
+    print("-" * 78)
+    print(f" available : {report['available_backends']}")
+    print(f" functional: {report['functional_backends']}")
+    print(f" report → {REPORT_PATH}")
+    print("=" * 78 + "\n")
+if __name__ == "__main__":
+    main()

scripts/run_dev.sh ADDED Viewed

	@@ -0,0 +1,35 @@

+#!/usr/bin/env bash
+# Launch the Aperture backend (FastAPI) and frontend (Vite) together.
+set -euo pipefail
+ROOT="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)"
+cd "$ROOT"
+echo "▶ Aperture dev launcher"
+# 1) ensure samples exist
+if [ ! -f backend/evals/datasets/invoice_acme_digital.gt.json ]; then
+  echo "  · generating sample corpus…"
+  python3 scripts/generate_samples.py >/dev/null
+fi
+# 2) backend
+echo "  · starting backend on :8000"
+( cd backend && uvicorn app.main:app --port 8000 --reload ) &
+BACK=$!
+# 3) frontend
+if [ ! -d frontend/node_modules ]; then
+  echo "  · installing frontend deps (first run)…"
+  ( cd frontend && npm install --silent )
+fi
+echo "  · starting frontend on :5173"
+( cd frontend && npm run dev ) &
+FRONT=$!
+trap 'echo; echo "stopping…"; kill $BACK $FRONT 2>/dev/null || true' INT TERM
+echo
+echo "  backend:  http://localhost:8000/docs"
+echo "  frontend: http://localhost:5173"
+echo "  (Ctrl-C to stop)"
+wait

scripts/test_ocr.py ADDED Viewed

	@@ -0,0 +1,57 @@

+#!/usr/bin/env python3
+"""Quick OCR backend tester.
+    python scripts/test_ocr.py <sample_id_or_path> [--backend auto|minicpm|cohere|llamaparse|tesseract|easyocr|sidecar]
+Examples:
+    python scripts/test_ocr.py invoice_scanned_basic
+    python scripts/test_ocr.py invoice_scanned_basic --backend minicpm
+    python scripts/test_ocr.py /path/to/receipt.png --backend cohere
+"""
+from __future__ import annotations
+import argparse
+import sys
+from pathlib import Path
+ROOT = Path(__file__).resolve().parent.parent
+sys.path.insert(0, str(ROOT / "backend"))
+from app.config import get_settings  # noqa: E402
+from app.ocr.backends import build_ocr_registry  # noqa: E402
+def resolve(arg: str, settings) -> Path | None:
+    p = Path(arg)
+    if p.exists():
+        return p
+    for ext in (".pdf", ".png", ".jpg", ".jpeg"):
+        cand = settings.evals_dataset_dir / f"{arg}{ext}"
+        if cand.exists():
+            return cand
+    return None
+def main() -> None:
+    ap = argparse.ArgumentParser()
+    ap.add_argument("doc")
+    ap.add_argument("--backend", default="auto")
+    args = ap.parse_args()
+    s = get_settings()
+    path = resolve(args.doc, s)
+    if not path:
+        print(f"not found: {args.doc}")
+        sys.exit(1)
+    reg = build_ocr_registry(s)
+    print(f"available backends: {reg.available_names()}")
+    res, attempts = reg.extract(path, args.backend)
+    print(f"\nattempts: {attempts}")
+    print(f"\nengine={res.engine} tier={res.tier} pages={res.pages} "
+          f"conf={res.confidence} chars={len(res.text)} simulated={res.simulated}")
+    print("\n--- text (first 1200 chars) ---")
+    print(res.text[:1200])
+if __name__ == "__main__":
+    main()