Spaces:
Running on Zero
Running on Zero
File size: 8,649 Bytes
70650b7 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 | # M24 β Reranking Service
**Spec version:** v2.0
**Depends on:** [M03 Capability Bus](../../modules/M03-capability-bus.md), [M01 Identity](../../modules/M01-identity.md), [X03 Observability](../../cross-cutting/X03-observability.md)
**Depended on by:** [M05 RAG](../../modules/M05-rag.md) (extension), [M06 Marketplace](../../modules/M06-marketplace.md) (extension)
---
## 1. Responsibility
Re-score a candidate list of documents against a query using a cross-encoder, producing a higher-precision ordering than dense retrieval alone can deliver.
The capability is intentionally narrow: take query + N short docs, return ranked list. The service does **not** retrieve documents, does **not** fetch from blobs, does **not** know about corpora. Callers (typically `rag.query` and `market.search`) do retrieval first, then ask the reranker to refine the top 100.
This is the smallest service in Phase 2 β one model, one method, no streaming β and the most underrated. Adding it to the RAG pipeline lifts answer quality more than any other Phase 2 module.
---
## 2. File layout
```
hearthnet/services/rerank/
βββ __init__.py
βββ service.py # RerankService β capability registration
βββ selection.py # Picks a backend; loads on demand
βββ backends/
βββ base.py # RerankBackend ABC
βββ bge_reranker.py # BGE-reranker-v2-m3 (default)
βββ cross_encoder.py # Sentence-transformers fallback
```
---
## 3. Public API
### 3.1 `RerankBackend` (ABC)
```python
class RerankBackend(Protocol):
name: str # e.g. "BAAI/bge-reranker-v2-m3"
max_doc_chars: int # truncate longer docs
async def score(self, query: str, documents: list[str]) -> list[float]:
"""Return one score per document, same length and order."""
...
async def health(self) -> dict[str, Any]:
"""Return {"ok": bool, "loaded": bool, "model_id": ..., ...}."""
...
```
### 3.2 `BgeRerankerBackend`
```python
class BgeRerankerBackend:
def __init__(self, model_id: str = "BAAI/bge-reranker-v2-m3", device: str = "auto", max_batch: int = 32):
...
async def score(self, query: str, documents: list[str]) -> list[float]:
# Tokenise (query, doc) pairs in batches of max_batch
# Forward pass; pooled logit becomes the score
# Higher = more relevant
...
```
### 3.3 `RerankService`
```python
class RerankService:
"""Bus-facing facade. Picks backend, enforces RERANK_MAX_DOCS, emits metrics."""
def __init__(self, bus: CapabilityBus, settings: RerankSettings, observability: Observability):
...
async def start(self) -> None:
# Register `rerank.text@1.0` on the bus
...
async def rerank_text(self, body: RerankRequest) -> RerankResponse:
# 1. Validate len(documents) <= RERANK_MAX_DOCS (else bad_request)
# 2. Pick backend per `body.params.model` or default
# 3. Truncate docs to backend.max_doc_chars
# 4. Call backend.score
# 5. Sort descending, take top_k (or all)
# 6. Emit `rerank.latency_ms` metric
...
```
### 3.4 Request / response dataclasses
```python
@dataclass
class RerankDoc:
id: str
text: str
@dataclass
class RerankRequest:
query: str
documents: list[RerankDoc]
top_k: int = 10
params: dict[str, Any] = field(default_factory=dict) # {"model": "..."}
@dataclass
class RerankedDoc:
id: str
score: float
@dataclass
class RerankResponse:
ranked: list[RerankedDoc]
meta: dict[str, Any]
```
---
## 4. Behaviour
### 4.1 Backend selection
`params.model` is matched against installed backends (key = HuggingFace model id). Default is `BAAI/bge-reranker-v2-m3` because it handles β₯100 languages including German and Latin (relevant for the OCR'd historical doc corpus).
If `params.model` is supplied but unknown β return `bad_request` with the list of installed backends.
### 4.2 Cold start
Backend is loaded lazily on first call. First call latency budget: β€ 60s on the RTX 5090 (model ~2 GB on disk). Subsequent calls: β€ 200ms for 50 docs at ~512 chars each.
The service publishes `model_loaded` and `model_loading` health states; `rerank.text` calls during loading wait up to `RERANK_LOAD_TIMEOUT_SECONDS` (default 60) then return `unavailable`.
### 4.3 Score semantics
Scores are **raw logits**, not normalised probabilities. They are comparable within a single call but not across calls or backends. Callers MUST NOT compare a 0.91 score from BGE to a 0.91 from cross-encoder/ms-marco β different scales.
### 4.4 Truncation
Documents longer than `backend.max_doc_chars` (default 2048) are truncated. The service logs `rerank.docs_truncated` counter. Truncation is from the right; callers who care about specific spans should pre-summarise or chunk before passing in.
### 4.5 No streaming
`rerank.text@1.0` is non-streaming. Even at 100 docs the latency is well under 1s on GPU. If a Phase-3 use case demands streaming (e.g. 1000-doc reranks for academic search), introduce `rerank.text@2.0` with `progress` frames; do not retrofit v1.
### 4.6 Integration with RAG (M05 extension)
`rag.query` in Phase 2 grows an internal pipeline:
```
1. Hybrid retrieval (dense + BM25) β top 100 candidates
2. Optional call to rerank.text@1.0 β top 10
3. Pass top 10 to llm.chat as context
```
The hop to `rerank.text` is done via the bus, not via direct import. This keeps the policy ("which model?", "is reranking available?") in the service and out of the RAG core.
If `rerank.text@1.0` is unavailable in the local mesh, RAG falls back to dense scores alone and logs `rag.rerank_skipped` counter (not an error).
### 4.7 Integration with Marketplace (M06 extension)
`market.search` follows the same pattern when the query is natural-language. For tag-based queries it skips reranking.
---
## 5. Errors
| Code | Cause |
|------|-------|
| `bad_request` | `len(documents) > RERANK_MAX_DOCS`, empty query, malformed payload |
| `unavailable` | Backend loading or hardware unavailable |
| `model_not_found` | Requested `params.model` is not installed |
`unavailable` is retryable; the other two are not.
---
## 6. Configuration
```toml
[services.rerank]
enabled = true
default_model = "BAAI/bge-reranker-v2-m3"
device = "auto" # "auto" | "cuda" | "cpu"
max_batch = 32
max_doc_chars = 2048
load_timeout_seconds = 60
trust_required = "member"
```
Behind a feature flag: when `enabled=false`, the capability simply does not register and RAG falls back to dense-only.
---
## 7. Tests
### 7.1 Unit
- Sorting: scores `[0.1, 0.9, 0.5]` produce ranked order `[1, 2, 0]`
- Truncation: 4000-char doc gets truncated to 2048 before scoring
- `top_k` honoured; returns at most `top_k` results
- Bad request when `documents=[]` or `len > RERANK_MAX_DOCS`
### 7.2 Integration
- End-to-end: `rag.query` with reranking vs without, on the niederrhein-emergency corpus, asserts at least one expected document moves into top 3 with rerank that wasn't there without
- Cross-language: German query, mixed German/English candidates, BGE reranker should put the German candidate first when relevance is equal
### 7.3 Performance
- 100 docs @ 1024 chars: p50 β€ 300ms on RTX 5090; p95 β€ 600ms
- CPU fallback (no GPU): p50 β€ 4s for 50 docs (acceptable; degraded)
### 7.4 Failure-mode
- Backend crash mid-call: caller receives `unavailable`; service self-heals on next call
- Concurrent calls: 20 parallel reranks should not deadlock; backend serialises behind a single semaphore
---
## 8. Cross-references
- Capability spec: [CAPABILITY_CONTRACT_v2 Β§4.15](../CAPABILITY_CONTRACT_v2.md#415-reranktext10)
- Used by: M05 RAG extension, M06 Marketplace extension
- Observability: emits `rerank.calls_total`, `rerank.latency_ms`, `rerank.docs_truncated`, `rerank.errors_total{code}`
---
## 9. Open questions
1. **Reciprocal rank fusion** with dense scores as the alternative when rerank is unavailable β worth implementing in M05 as the fallback path?
2. **ColBERT-style late interaction** β heavier model, higher quality. Worth a second backend, or wait for Phase 3 to evaluate?
3. **Reranker for code/diff content** β different model family (e.g. `BAAI/bge-code-reranker`). Should `params.model` selection be auto-inferred from query/doc content?
4. **Caching** β query+doc-pair hash β score, evict LRU. Worth it for repeated queries in chat-driven RAG sessions, or premature optimisation?
|