docs: honest cross-system comparison (NVIDIA/Cohere, 300q) + roadmap

4ac48b5 verified 4 days ago

6.39 kB

	---
	language:
	- id
	license: apache-2.0
	library_name: sentence-transformers
	pipeline_tag: text-ranking
	tags:
	- reranker
	- cross-encoder
	- text-ranking
	- indonesian
	- bahasa-indonesia
	- knowledge-distillation
	- flashrank
	- onnx
	base_model: cross-encoder/mmarco-mMiniLMv2-L12-H384-v1
	datasets:
	- google-research-datasets/tydiqa
	- miracl/miracl
	metrics:
	- mrr
	- ndcg
	---

	# rerank-indonesia

	A lightweight Indonesian (Bahasa Indonesia) cross-encoder reranker, small
	enough to serve on a cheap CPU VPS yet competitive with a 17× larger model.

	It is built by Margin-MSE knowledge distillation: a strong multilingual
	teacher, [`BAAI/bge-reranker-v2-m3`](https://huggingface.co/BAAI/bge-reranker-v2-m3)
	(568M params), supervises the tiny student
	[`cross-encoder/mmarco-mMiniLMv2-L12-H384-v1`](https://huggingface.co/cross-encoder/mmarco-mMiniLMv2-L12-H384-v1)
	on in-domain Indonesian (query, positive, negative) triplets from TyDi QA and
	MIRACL-id (with BM25 + dense hard-negative mining). The student learns the
	teacher's score margin between relevant and non-relevant passages.

	Built as part of [flashIndorank](https://github.com/madebyaris/flashIndorank).

	## Evaluation

	MIRACL-id official retrieve-then-rerank protocol (BM25 top-100 → rerank,
	960 dev queries, `pytrec_eval`):

	\| model \| params \| nDCG@10 \| MRR@10 \| Recall@100 \|
	\| --- \| --- \| --- \| --- \| --- \|
	\| `cross-encoder/mmarco-mMiniLMv2-L12-H384-v1` (base) \| tiny \| 0.656 \| 0.623 \| 0.760 \|
	\| this model (in-domain distillation) \| tiny \| 0.701 \| 0.677 \| 0.760 \|
	\| `BAAI/bge-reranker-v2-m3` (teacher) \| 568M \| 0.712 \| 0.689 \| 0.760 \|

	The distilled student improves nDCG@10 by +4.5 points over the base while
	staying within ~1 point of the 568M teacher — roughly 98% of the teacher's
	ranking quality at a fraction of the size and latency. (Recall@100 is the BM25
	first-stage ceiling and bounds all rerankers.)

	### How it compares to hosted commercial rerankers

	An independent cross-system check on 300 MIRACL-id `dev` queries (BM25 top-100
	→ rerank). Every reranker is scored with the same harness, the same BM25
	candidates, and the same metric implementation, so the comparison is
	apples-to-apples. NVIDIA and Cohere were called through the OpenRouter rerank API.

	\| reranker \| nDCG@10 \| MRR@10 \| cost / availability \|
	\| --- \| --- \| --- \| --- \|
	\| BM25 (no rerank) \| 0.393 \| 0.330 \| — \|
	\| this model (int8 ONNX, CPU) \| 0.655 \| 0.633 \| free · local · offline \|
	\| `nvidia/llama-nemotron-rerank-vl-1b-v2` \| 0.656 \| 0.632 \| hosted API \|
	\| `cohere/rerank-v3.5` \| 0.664 \| 0.636 \| paid API \|
	\| `cohere/rerank-4-pro` \| 0.665 \| 0.640 \| paid API \|

	Takeaways:

	- Statistically tied with NVIDIA's hosted reranker (nDCG@10 0.655 vs 0.656; it is
	marginally ahead on MRR@10) — while running free and offline on CPU.
	- Within ~0.01 nDCG (~1.5%) of Cohere's strongest commercial reranker.

	> Honesty note: the absolute scores in this comparison are slightly lower than the
	> 0.701 reported above because this is a 300-query slice scored with
	> flashIndorank's own metric harness, not the full 960-query `pytrec_eval` run. The
	> relative standing (≈ NVIDIA, just under Cohere) is the point. A smaller
	> 30-query slice was even noisier and is not a reliable signal — prefer these
	> 300-query (or the full 960) numbers.

	## Usage

	### sentence-transformers

	```python
	from sentence_transformers import CrossEncoder

	model = CrossEncoder("madebyaris/rerank-indonesia")
	query = "Bagaimana cara menurunkan berat badan?"
	passages = [
	"Olahraga teratur dan pola makan sehat membantu mengurangi bobot tubuh.",
	"Harga emas global naik tajam dalam sepekan terakhir.",
	]
	scores = model.predict([[query, p] for p in passages])
	print(scores)
	```

	### Lightweight ONNX (int8) via flashIndorank

	```python
	from huggingface_hub import snapshot_download
	from flashindorank import CustomReranker
	from flashrank import RerankRequest

	path = snapshot_download("madebyaris/rerank-indonesia", allow_patterns=["onnx/*"])
	ranker = CustomReranker(f"{path}/onnx")
	out = ranker.rerank(RerankRequest(
	query="Bagaimana cara menurunkan berat badan?",
	passages=[{"id": 1, "text": "Olahraga teratur dan pola makan sehat membantu mengurangi bobot tubuh."}],
	))
	print(out)
	```

	## Training

	- Student / base: `cross-encoder/mmarco-mMiniLMv2-L12-H384-v1`
	- Teacher: `BAAI/bge-reranker-v2-m3`
	- Method: Margin-MSE knowledge distillation (Hofstätter et al., 2020) —
	`label = teacher(q, pos) - teacher(q, neg)`
	- Data: in-domain Indonesian triplets from TyDi QA + MIRACL-id train,
	BM25 + dense hard negatives
	- Optimizer: 3 epochs, lr 8e-6, bf16, `MarginMSELoss` (`CrossEncoderTrainer`)

	See [TRAINING.md](https://github.com/madebyaris/flashIndorank/blob/main/TRAINING.md).

	## Roadmap — what's next to improve

	The model is already quality-competitive with hosted rerankers; the remaining wins,
	highest-leverage first:

	1. Close the small gap to Cohere (quality). Re-distill on combined data —
	mMARCO-id (~400k triplets) + in-domain TyDi/MIRACL-id upsampled ~3× so in-domain
	signal isn't diluted — for a few more epochs. Targets pushing nDCG@10 past 0.70
	toward the teacher ceiling.
	2. Stronger / ensemble teacher. Distill from a larger teacher (e.g.
	`BAAI/bge-reranker-v2-gemma`) or an ensemble of teacher margins to raise the
	distillation ceiling above the current ~0.712.
	3. Harder negatives. Re-mine negatives with a strong dense retriever (not just
	BM25); cross-encoders learn most from hard negatives.
	4. Lift the real ceiling = better first-stage retrieval. MIRACL nDCG is capped by
	`Recall@100` (~0.71–0.76 here). A better retriever (multilingual-e5 / BGE-M3 dense,
	or hybrid BM25+dense) raises the candidates the reranker sees — likely a bigger
	end-to-end win than any reranker tweak.
	5. Faster CPU serving. The int8 ONNX is quality-ready; latency is the lever.
	Length-sorted mini-batching (cut padding waste), an optional multi-threaded ONNX
	mode, and a lower default `max_length` (256) materially reduce CPU latency and RAM.
	6. Broaden evaluation. Report the full 960-query MIRACL-id run and add other
	domains (e-commerce, news) so the quality claim generalizes beyond Wikipedia QA.

	## License

	Apache-2.0, inherited from the base model. TyDi QA and MIRACL are Apache-2.0.