HearthNet-Nemotron

Running on Zero

File size: 8,649 Bytes

70650b7

# M24 — Reranking Service

**Spec version:** v2.0
**Depends on:** [M03 Capability Bus](../../modules/M03-capability-bus.md), [M01 Identity](../../modules/M01-identity.md), [X03 Observability](../../cross-cutting/X03-observability.md)
**Depended on by:** [M05 RAG](../../modules/M05-rag.md) (extension), [M06 Marketplace](../../modules/M06-marketplace.md) (extension)

---

## 1. Responsibility

Re-score a candidate list of documents against a query using a cross-encoder, producing a higher-precision ordering than dense retrieval alone can deliver.

The capability is intentionally narrow: take query + N short docs, return ranked list. The service does **not** retrieve documents, does **not** fetch from blobs, does **not** know about corpora. Callers (typically `rag.query` and `market.search`) do retrieval first, then ask the reranker to refine the top 100.

This is the smallest service in Phase 2 — one model, one method, no streaming — and the most underrated. Adding it to the RAG pipeline lifts answer quality more than any other Phase 2 module.

---

## 2. File layout

```
hearthnet/services/rerank/
├── __init__.py
├── service.py            # RerankService — capability registration
├── selection.py          # Picks a backend; loads on demand
└── backends/
    ├── base.py           # RerankBackend ABC
    ├── bge_reranker.py   # BGE-reranker-v2-m3 (default)
    └── cross_encoder.py  # Sentence-transformers fallback
```

---

## 3. Public API

### 3.1 `RerankBackend` (ABC)

```python
class RerankBackend(Protocol):
    name: str            # e.g. "BAAI/bge-reranker-v2-m3"
    max_doc_chars: int   # truncate longer docs

    async def score(self, query: str, documents: list[str]) -> list[float]:
        """Return one score per document, same length and order."""
        ...

    async def health(self) -> dict[str, Any]:
        """Return {"ok": bool, "loaded": bool, "model_id": ..., ...}."""
        ...
```

### 3.2 `BgeRerankerBackend`

```python
class BgeRerankerBackend:
    def __init__(self, model_id: str = "BAAI/bge-reranker-v2-m3", device: str = "auto", max_batch: int = 32):
        ...

    async def score(self, query: str, documents: list[str]) -> list[float]:
        # Tokenise (query, doc) pairs in batches of max_batch
        # Forward pass; pooled logit becomes the score
        # Higher = more relevant
        ...
```

### 3.3 `RerankService`

```python
class RerankService:
    """Bus-facing facade.  Picks backend, enforces RERANK_MAX_DOCS, emits metrics."""

    def __init__(self, bus: CapabilityBus, settings: RerankSettings, observability: Observability):
        ...

    async def start(self) -> None:
        # Register `rerank.text@1.0` on the bus
        ...

    async def rerank_text(self, body: RerankRequest) -> RerankResponse:
        # 1. Validate len(documents) <= RERANK_MAX_DOCS (else bad_request)
        # 2. Pick backend per `body.params.model` or default
        # 3. Truncate docs to backend.max_doc_chars
        # 4. Call backend.score
        # 5. Sort descending, take top_k (or all)
        # 6. Emit `rerank.latency_ms` metric
        ...
```

### 3.4 Request / response dataclasses

```python
@dataclass
class RerankDoc:
    id: str
    text: str

@dataclass
class RerankRequest:
    query:      str
    documents:  list[RerankDoc]
    top_k:      int = 10
    params:     dict[str, Any] = field(default_factory=dict)   # {"model": "..."}

@dataclass
class RerankedDoc:
    id:    str
    score: float

@dataclass
class RerankResponse:
    ranked: list[RerankedDoc]
    meta:   dict[str, Any]
```

---

## 4. Behaviour

### 4.1 Backend selection

`params.model` is matched against installed backends (key = HuggingFace model id). Default is `BAAI/bge-reranker-v2-m3` because it handles ≥100 languages including German and Latin (relevant for the OCR'd historical doc corpus).

If `params.model` is supplied but unknown → return `bad_request` with the list of installed backends.

### 4.2 Cold start

Backend is loaded lazily on first call. First call latency budget: ≤ 60s on the RTX 5090 (model ~2 GB on disk). Subsequent calls: ≤ 200ms for 50 docs at ~512 chars each.

The service publishes `model_loaded` and `model_loading` health states; `rerank.text` calls during loading wait up to `RERANK_LOAD_TIMEOUT_SECONDS` (default 60) then return `unavailable`.

### 4.3 Score semantics

Scores are **raw logits**, not normalised probabilities. They are comparable within a single call but not across calls or backends. Callers MUST NOT compare a 0.91 score from BGE to a 0.91 from cross-encoder/ms-marco — different scales.

### 4.4 Truncation

Documents longer than `backend.max_doc_chars` (default 2048) are truncated. The service logs `rerank.docs_truncated` counter. Truncation is from the right; callers who care about specific spans should pre-summarise or chunk before passing in.

### 4.5 No streaming

`rerank.text@1.0` is non-streaming. Even at 100 docs the latency is well under 1s on GPU. If a Phase-3 use case demands streaming (e.g. 1000-doc reranks for academic search), introduce `rerank.text@2.0` with `progress` frames; do not retrofit v1.

### 4.6 Integration with RAG (M05 extension)

`rag.query` in Phase 2 grows an internal pipeline:

```
1. Hybrid retrieval (dense + BM25) → top 100 candidates
2. Optional call to rerank.text@1.0 → top 10
3. Pass top 10 to llm.chat as context
```

The hop to `rerank.text` is done via the bus, not via direct import. This keeps the policy ("which model?", "is reranking available?") in the service and out of the RAG core.

If `rerank.text@1.0` is unavailable in the local mesh, RAG falls back to dense scores alone and logs `rag.rerank_skipped` counter (not an error).

### 4.7 Integration with Marketplace (M06 extension)

`market.search` follows the same pattern when the query is natural-language. For tag-based queries it skips reranking.

---

## 5. Errors

| Code | Cause |
|------|-------|
| `bad_request` | `len(documents) > RERANK_MAX_DOCS`, empty query, malformed payload |
| `unavailable` | Backend loading or hardware unavailable |
| `model_not_found` | Requested `params.model` is not installed |

`unavailable` is retryable; the other two are not.

---

## 6. Configuration

```toml
[services.rerank]
enabled                  = true
default_model            = "BAAI/bge-reranker-v2-m3"
device                   = "auto"             # "auto" | "cuda" | "cpu"
max_batch                = 32
max_doc_chars            = 2048
load_timeout_seconds     = 60
trust_required           = "member"
```

Behind a feature flag: when `enabled=false`, the capability simply does not register and RAG falls back to dense-only.

---

## 7. Tests

### 7.1 Unit
- Sorting: scores `[0.1, 0.9, 0.5]` produce ranked order `[1, 2, 0]`
- Truncation: 4000-char doc gets truncated to 2048 before scoring
- `top_k` honoured; returns at most `top_k` results
- Bad request when `documents=[]` or `len > RERANK_MAX_DOCS`

### 7.2 Integration
- End-to-end: `rag.query` with reranking vs without, on the niederrhein-emergency corpus, asserts at least one expected document moves into top 3 with rerank that wasn't there without
- Cross-language: German query, mixed German/English candidates, BGE reranker should put the German candidate first when relevance is equal

### 7.3 Performance
- 100 docs @ 1024 chars: p50 ≤ 300ms on RTX 5090; p95 ≤ 600ms
- CPU fallback (no GPU): p50 ≤ 4s for 50 docs (acceptable; degraded)

### 7.4 Failure-mode
- Backend crash mid-call: caller receives `unavailable`; service self-heals on next call
- Concurrent calls: 20 parallel reranks should not deadlock; backend serialises behind a single semaphore

---

## 8. Cross-references

- Capability spec: [CAPABILITY_CONTRACT_v2 §4.15](../CAPABILITY_CONTRACT_v2.md#415-reranktext10)
- Used by: M05 RAG extension, M06 Marketplace extension
- Observability: emits `rerank.calls_total`, `rerank.latency_ms`, `rerank.docs_truncated`, `rerank.errors_total{code}`

---

## 9. Open questions

1. **Reciprocal rank fusion** with dense scores as the alternative when rerank is unavailable — worth implementing in M05 as the fallback path?
2. **ColBERT-style late interaction** — heavier model, higher quality. Worth a second backend, or wait for Phase 3 to evaluate?
3. **Reranker for code/diff content** — different model family (e.g. `BAAI/bge-code-reranker`). Should `params.model` selection be auto-inferred from query/doc content?
4. **Caching** — query+doc-pair hash → score, evict LRU. Worth it for repeated queries in chat-driven RAG sessions, or premature optimisation?