rerank-indonesia / README.md
madebyaris's picture
docs: honest cross-system comparison (NVIDIA/Cohere, 300q) + roadmap
4ac48b5 verified
|
Raw
History Blame Contribute Delete
6.39 kB
---
language:
- id
license: apache-2.0
library_name: sentence-transformers
pipeline_tag: text-ranking
tags:
- reranker
- cross-encoder
- text-ranking
- indonesian
- bahasa-indonesia
- knowledge-distillation
- flashrank
- onnx
base_model: cross-encoder/mmarco-mMiniLMv2-L12-H384-v1
datasets:
- google-research-datasets/tydiqa
- miracl/miracl
metrics:
- mrr
- ndcg
---
# rerank-indonesia
A lightweight **Indonesian (Bahasa Indonesia) cross-encoder reranker**, small
enough to serve on a cheap CPU VPS yet competitive with a 17× larger model.
It is built by **Margin-MSE knowledge distillation**: a strong multilingual
teacher, [`BAAI/bge-reranker-v2-m3`](https://huggingface.co/BAAI/bge-reranker-v2-m3)
(568M params), supervises the tiny student
[`cross-encoder/mmarco-mMiniLMv2-L12-H384-v1`](https://huggingface.co/cross-encoder/mmarco-mMiniLMv2-L12-H384-v1)
on in-domain Indonesian (query, positive, negative) triplets from **TyDi QA** and
**MIRACL-id** (with BM25 + dense hard-negative mining). The student learns the
teacher's score *margin* between relevant and non-relevant passages.
Built as part of [flashIndorank](https://github.com/madebyaris/flashIndorank).
## Evaluation
**MIRACL-id** official retrieve-then-rerank protocol (BM25 top-100 → rerank,
960 dev queries, `pytrec_eval`):
| model | params | nDCG@10 | MRR@10 | Recall@100 |
| --- | --- | --- | --- | --- |
| `cross-encoder/mmarco-mMiniLMv2-L12-H384-v1` (base) | tiny | 0.656 | 0.623 | 0.760 |
| **this model** (in-domain distillation) | **tiny** | **0.701** | **0.677** | 0.760 |
| `BAAI/bge-reranker-v2-m3` (teacher) | 568M | 0.712 | 0.689 | 0.760 |
The distilled student improves nDCG@10 by **+4.5 points** over the base while
staying within **~1 point of the 568M teacher** — roughly 98% of the teacher's
ranking quality at a fraction of the size and latency. (Recall@100 is the BM25
first-stage ceiling and bounds all rerankers.)
### How it compares to hosted commercial rerankers
An independent cross-system check on **300 MIRACL-id `dev` queries** (BM25 top-100
→ rerank). Every reranker is scored with the **same** harness, the **same** BM25
candidates, and the **same** metric implementation, so the comparison is
apples-to-apples. NVIDIA and Cohere were called through the OpenRouter rerank API.
| reranker | nDCG@10 | MRR@10 | cost / availability |
| --- | --- | --- | --- |
| BM25 (no rerank) | 0.393 | 0.330 | — |
| **this model** (int8 ONNX, CPU) | 0.655 | 0.633 | **free · local · offline** |
| `nvidia/llama-nemotron-rerank-vl-1b-v2` | 0.656 | 0.632 | hosted API |
| `cohere/rerank-v3.5` | 0.664 | 0.636 | paid API |
| `cohere/rerank-4-pro` | 0.665 | 0.640 | paid API |
Takeaways:
- **Statistically tied with NVIDIA's hosted reranker** (nDCG@10 0.655 vs 0.656; it is
marginally *ahead* on MRR@10) — while running free and offline on CPU.
- Within **~0.01 nDCG (~1.5%)** of Cohere's strongest commercial reranker.
> Honesty note: the absolute scores in this comparison are slightly lower than the
> **0.701** reported above because this is a 300-query slice scored with
> flashIndorank's own metric harness, not the full 960-query `pytrec_eval` run. The
> **relative** standing (≈ NVIDIA, just under Cohere) is the point. A smaller
> 30-query slice was even noisier and is not a reliable signal — prefer these
> 300-query (or the full 960) numbers.
## Usage
### sentence-transformers
```python
from sentence_transformers import CrossEncoder
model = CrossEncoder("madebyaris/rerank-indonesia")
query = "Bagaimana cara menurunkan berat badan?"
passages = [
"Olahraga teratur dan pola makan sehat membantu mengurangi bobot tubuh.",
"Harga emas global naik tajam dalam sepekan terakhir.",
]
scores = model.predict([[query, p] for p in passages])
print(scores)
```
### Lightweight ONNX (int8) via flashIndorank
```python
from huggingface_hub import snapshot_download
from flashindorank import CustomReranker
from flashrank import RerankRequest
path = snapshot_download("madebyaris/rerank-indonesia", allow_patterns=["onnx/*"])
ranker = CustomReranker(f"{path}/onnx")
out = ranker.rerank(RerankRequest(
query="Bagaimana cara menurunkan berat badan?",
passages=[{"id": 1, "text": "Olahraga teratur dan pola makan sehat membantu mengurangi bobot tubuh."}],
))
print(out)
```
## Training
- Student / base: `cross-encoder/mmarco-mMiniLMv2-L12-H384-v1`
- Teacher: `BAAI/bge-reranker-v2-m3`
- Method: Margin-MSE knowledge distillation (Hofstätter et al., 2020) —
`label = teacher(q, pos) - teacher(q, neg)`
- Data: in-domain Indonesian triplets from TyDi QA + MIRACL-id train,
BM25 + dense hard negatives
- Optimizer: 3 epochs, lr 8e-6, bf16, `MarginMSELoss` (`CrossEncoderTrainer`)
See [TRAINING.md](https://github.com/madebyaris/flashIndorank/blob/main/TRAINING.md).
## Roadmap — what's next to improve
The model is already quality-competitive with hosted rerankers; the remaining wins,
highest-leverage first:
1. **Close the small gap to Cohere (quality).** Re-distill on **combined** data —
mMARCO-id (~400k triplets) + in-domain TyDi/MIRACL-id upsampled ~3× so in-domain
signal isn't diluted — for a few more epochs. Targets pushing nDCG@10 past 0.70
toward the teacher ceiling.
2. **Stronger / ensemble teacher.** Distill from a larger teacher (e.g.
`BAAI/bge-reranker-v2-gemma`) or an ensemble of teacher margins to raise the
distillation ceiling above the current ~0.712.
3. **Harder negatives.** Re-mine negatives with a strong *dense* retriever (not just
BM25); cross-encoders learn most from hard negatives.
4. **Lift the real ceiling = better first-stage retrieval.** MIRACL nDCG is capped by
`Recall@100` (~0.71–0.76 here). A better retriever (multilingual-e5 / BGE-M3 dense,
or hybrid BM25+dense) raises the candidates the reranker sees — likely a bigger
end-to-end win than any reranker tweak.
5. **Faster CPU serving.** The int8 ONNX is quality-ready; latency is the lever.
Length-sorted mini-batching (cut padding waste), an optional multi-threaded ONNX
mode, and a lower default `max_length` (256) materially reduce CPU latency and RAM.
6. **Broaden evaluation.** Report the full 960-query MIRACL-id run and add other
domains (e-commerce, news) so the quality claim generalizes beyond Wikipedia QA.
## License
Apache-2.0, inherited from the base model. TyDi QA and MIRACL are Apache-2.0.