# tau-rag — Production Deployment Guide This guide covers everything you need to run tau-rag in production: image build, secrets, deployment (Docker Compose or Kubernetes), observability, backups, and incident response. --- ## 1. What you get - **Hardened Docker image** (`Dockerfile.prod`): non-root user (UID 10001), gunicorn + uvicorn workers, tini as PID 1, HEALTHCHECK, multi-stage build. - **Docker Compose stack** (`docker-compose.prod.yml`): app + nginx TLS proxy + Prometheus + Grafana. - **Kubernetes manifests** (`deployment/k8s/`): Deployment (3 replicas, HPA 3→12), Service, Ingress with cert-manager, PDB (minAvailable=2), NetworkPolicy (default deny + explicit allows), startup/liveness/readiness probes tuned for AlephBERT warmup. - **GitHub Actions** (`.github/workflows/`): lint + 3× Python test matrix + build & push to GHCR + Trivy security scan; separate CD pipeline for tagged releases. - **Corpus loader** (`scripts/init_corpus.py`): loads LawDBHeb on first boot; idempotent via marker file. - **Prometheus alert rules** (`deployment/prometheus/rules/`): availability, latency, retrieval flakiness, calibration drift, numeric hallucination. - **Grafana dashboard** (`deployment/grafana/dashboards/`): QPS, 5xx rate, p50/p95/p99 latency by stage, answer-quality signals. - **Production preset** (`configs/hebrew_legal_prod.json`): cache on, reranker on, MMR diversity, numeric-consistency gate, rate limits, all middleware flags enabled. --- ## 2. Prerequisites - Docker Desktop / OrbStack / Colima with `docker compose v2` - (optional) `kubectl` + `helm` for K8s deploy - (optional) cert-manager + ingress-nginx if using the K8s path - LawDBHeb corpus accessible at a readable path (local disk or mount) - Anthropic (or OpenAI) API key --- ## 3. First-time setup ```bash cd tau_rag cp .env.example .env # Edit .env — at minimum set: # ANTHROPIC_API_KEY # TAU_RAG_HMAC_SECRET (openssl rand -hex 32) # LAWDBHEB_HOST_PATH (absolute path to LawDBHeb on host) # TAU_RAG_SEED_ADMIN_KEY (first-boot admin key; rotate after) # Generate TLS certs (dev). For prod use letsencrypt/cert-manager. mkdir -p deployment/nginx/certs openssl req -x509 -nodes -days 365 -newkey rsa:2048 \ -keyout deployment/nginx/certs/privkey.pem \ -out deployment/nginx/certs/fullchain.pem \ -subj "/CN=tau-rag.local" ``` --- ## 4. Deploy — Docker Compose ```bash make prod-up # build + start app + nginx + prometheus + grafana make prod-ps # status make prod-logs # tail logs make prod-down # stop everything ``` Access: - API: `https://localhost/v1/health` (self-signed cert in dev) - Grafana: `http://localhost:3000` (admin / change-me) - Prometheus: `http://localhost:9090` (scraped via docker network) First-boot behavior: `scripts/entrypoint.sh` runs `init_corpus.py` if `$TAU_RAG_RUNTIME_DIR/.corpus_loaded` is absent. Warm-up can take ~60s while AlephBERT loads. --- ## 5. Deploy — Kubernetes ```bash # 1. Build + push image (or use CI — it does this on tag) make build-prod docker tag tau-rag:prod ghcr.io//tau-rag:v1.0.0 docker push ghcr.io//tau-rag:v1.0.0 # 2. Create real secret (do NOT commit) cp deployment/k8s/secret.example.yaml deployment/k8s/secret.yaml # edit secret.yaml # 3. Apply manifests make k8s-apply # 4. Watch rollout make k8s-status make k8s-logs ``` Persistent volumes required: - `tau-rag-runtime` — 10Gi RWO — stores signals.jsonl, snapshots, dashboards. - `lawdbheb-corpus` — 20Gi ROX — mount the LawDBHeb pickle/parquet data. The HPA scales 3→12 replicas on CPU>70% or memory>80%. PDB keeps at least 2 pods during voluntary disruptions (node drain, etc). NetworkPolicy denies all by default and allows only: ingress-nginx →:8000, prometheus scraper →:8000, egress DNS + egress 443 to public (for LLM API). --- ## 6. Secrets management **Never commit `.env`, `secret.yaml`, or certs.** Recommended options: - **Local/compose**: `.env` on host, restrictive file perms. - **K8s**: External Secrets Operator → AWS Secrets Manager / GCP Secret Manager / HashiCorp Vault. - **GitHub Actions**: repository secrets, referenced as `${{ secrets.X }}`. Rotation procedure: 1. Generate new key in provider 2. Create a new k8s Secret revision or update .env 3. `kubectl rollout restart deploy/tau-rag -n tau-rag` — zero-downtime. --- ## 7. Observability **Built-in endpoints:** - `GET /v1/health` — liveness (process up?). - `GET /v1/readiness` — readiness (corpus loaded, retrievers ready?). - `GET /metrics` — Prometheus text exposition (bytes, counters, histograms). - `GET /v1/version` — version + feature flags. **Key metrics exposed** (prefix `tau_rag_`): - `requests_total{endpoint,status}` — request counter - `request_duration_seconds_bucket{endpoint}` — latency histogram - `retrieval_duration_seconds_bucket{retriever}` — per-retriever latency - `generation_duration_seconds_bucket` — LLM call latency - `answer_grounding_ratio` — v3.30 grounding KPI - `numeric_support_ratio` — v3.77 numeric-consistency KPI - `coverage_ratio` — v3.60 term coverage - `flaky_queries_count` — v3.64 rank-stability count - `calibration_ece` — v3.55 expected calibration error - `numeric_unsupported_total` — v3.77 hallucination counter **Alerts** (`deployment/prometheus/rules/tau-rag-alerts.yml`): availability, 5xx rate, p95/p99 latency, flaky retrieval, ECE drift, numeric hallucinations. Wire `alertmanagers` in `prometheus.yml` to your on-call channel. **Dashboards**: the provided Grafana JSON loads as "tau-rag · Overview". Extend via the admin UI (auto-reloaded from `/etc/grafana/provisioning`). --- ## 8. Backups `scripts/backup_runtime.sh` tars `TAU_RAG_RUNTIME_DIR` to a timestamped path. Run via cron (host) or a `CronJob` (k8s): ```bash 0 3 * * * /app/scripts/backup_runtime.sh /backups/tau-rag/ ``` Set `TAU_RAG_BACKUP_RETENTION_DAYS=14` to rotate automatically. The LawDBHeb corpus itself is mounted read-only and treated as immutable — rebuild the index via your ingestion pipeline when the upstream corpus changes. --- ## 9. Upgrading **Zero-downtime rolling upgrade:** Docker Compose: ```bash docker compose -f docker-compose.prod.yml --env-file .env up -d --build tau-rag ``` Kubernetes: ```bash kubectl -n tau-rag set image deploy/tau-rag \ tau-rag=ghcr.io//tau-rag:v1.2.0 kubectl -n tau-rag rollout status deploy/tau-rag # rollback if needed: kubectl -n tau-rag rollout undo deploy/tau-rag ``` `rolling-update.maxSurge=1, maxUnavailable=0` keeps old pods until new pods pass the readiness probe. --- ## 10. Incident response runbook ### Symptom: 5xx error rate > 5% 1. Check `TauRagHighErrorRate` alert in Prometheus for top endpoints. 2. `kubectl -n tau-rag logs -l app.kubernetes.io/name=tau-rag --tail=200`. 3. Look for stack traces or OOMKilled. 4. If LLM provider down: flip `TAU_RAG_PRESET=no_llm` via ConfigMap → `kubectl rollout restart` — extractive fallback. 5. If bad deploy: `kubectl rollout undo deploy/tau-rag`. ### Symptom: p95 latency > 2s 1. Check `TauRagP95HighLatency`; look at retrieval vs generation histograms. 2. If retrieval dominant → check `tau_rag_flaky_queries_count`, scale replicas. 3. If generation dominant → check LLM provider status, reduce `max_tokens`, or switch `llm_router` to a cheaper model. 4. Scale: `kubectl -n tau-rag scale deploy/tau-rag --replicas=8` (HPA also does this automatically on CPU/memory). ### Symptom: Numeric hallucinations spike (v3.77) 1. `TauRagAnswerNumericHallucinations` firing means the LLM is inventing section/case numbers. 2. Review recent prompt or preset changes. 3. Tighten `verify.min_support_ratio` in the preset, or force regen via the v3.73 coverage-gap `REC_REGENERATE` path. ### Symptom: Calibration drift (v3.55) 1. `TauRagLowConfidenceCalibration` — ECE > 0.2 means reported confidence doesn't match reality. 2. Run the admin endpoint `POST /v1/admin/confidence_calibrator/analyze` to see per-bucket calibration error. 3. The calibrator continuously re-maps; if persistently off, reduce LLM temperature or switch model. ### Symptom: Disk fills up on runtime volume 1. Check signals.jsonl size — should rotate but check log_rotation. 2. `kubectl -n tau-rag exec deploy/tau-rag -- du -sh /app/runtime/*` 3. Increase PVC size or tune `retention` middleware. --- ## 11. Security checklist - [x] Non-root container (UID 10001) - [x] `allowPrivilegeEscalation=false`, capabilities dropped - [x] Pinned requirements.txt (reproducible builds) - [x] Trivy scans in CI (fail on CRITICAL/HIGH) - [x] HMAC signing on admin endpoints - [x] Rate limiting (nginx + in-app TokenBucket) - [x] NetworkPolicy default-deny in K8s - [x] TLS on nginx ingress; HSTS header - [x] No secrets in image; runtime injection only - [ ] **Rotate seeded admin API key on first login** (you have to do this manually) - [ ] **Enable WAF** at your LB if exposed to internet - [ ] **Audit logs** exported to SIEM (use `/v1/audit_export` endpoint) --- ## 12. Gaps / known limitations - The in-memory middleware state (cache, stats, histories) does not persist across pod restarts; rely on `runtime/` volume for snapshots. - Corpus reload requires a pod restart (no hot-reload path yet). - Some v3.x middleware modules are admin-only (not wired into the main query pipeline). Turn them on via `configs/hebrew_legal_prod.json` → `middleware_flags.*: true`. - Local LawDBHeb pickles (~1.5 GB for bm25+chunks) must fit in the pod memory footprint (6 Gi limit default — tune via HPA + deployment). --- ## 13. Files you should know about ``` tau_rag/ ├── Dockerfile.prod # hardened image ├── docker-compose.prod.yml # full stack ├── requirements.txt # pinned prod deps ├── requirements-dev.txt # + pytest/ruff/mypy ├── .env.example # all env vars documented ├── .dockerignore # keep image slim ├── configs/hebrew_legal_prod.json # production preset ├── scripts/ │ ├── entrypoint.sh # dispatcher │ ├── healthcheck.sh # docker/k8s probe │ ├── init_corpus.py # first-boot corpus load │ └── backup_runtime.sh # runtime volume tar ├── deployment/ │ ├── nginx/ # TLS proxy + rate limiting │ ├── prometheus/ # scrape config + alert rules │ ├── grafana/ # dashboards + datasources │ └── k8s/ # Deployment, HPA, PDB, Ingress, NP ├── .github/workflows/ │ ├── ci.yml # test + build + scan │ └── cd.yml # deploy on tag └── PRODUCTION.md # this doc ``` Quick commands: `make help`