--- language: - id license: apache-2.0 library_name: sentence-transformers pipeline_tag: text-ranking tags: - reranker - cross-encoder - text-ranking - indonesian - bahasa-indonesia - knowledge-distillation - flashrank - onnx base_model: cross-encoder/mmarco-mMiniLMv2-L12-H384-v1 datasets: - google-research-datasets/tydiqa - miracl/miracl metrics: - mrr - ndcg --- # rerank-indonesia A lightweight **Indonesian (Bahasa Indonesia) cross-encoder reranker**, small enough to serve on a cheap CPU VPS yet competitive with a 17× larger model. It is built by **Margin-MSE knowledge distillation**: a strong multilingual teacher, [`BAAI/bge-reranker-v2-m3`](https://huggingface.co/BAAI/bge-reranker-v2-m3) (568M params), supervises the tiny student [`cross-encoder/mmarco-mMiniLMv2-L12-H384-v1`](https://huggingface.co/cross-encoder/mmarco-mMiniLMv2-L12-H384-v1) on in-domain Indonesian (query, positive, negative) triplets from **TyDi QA** and **MIRACL-id** (with BM25 + dense hard-negative mining). The student learns the teacher's score *margin* between relevant and non-relevant passages. Built as part of [flashIndorank](https://github.com/madebyaris/flashIndorank). ## Evaluation **MIRACL-id** official retrieve-then-rerank protocol (BM25 top-100 → rerank, 960 dev queries, `pytrec_eval`): | model | params | nDCG@10 | MRR@10 | Recall@100 | | --- | --- | --- | --- | --- | | `cross-encoder/mmarco-mMiniLMv2-L12-H384-v1` (base) | tiny | 0.656 | 0.623 | 0.760 | | **this model** (in-domain distillation) | **tiny** | **0.701** | **0.677** | 0.760 | | `BAAI/bge-reranker-v2-m3` (teacher) | 568M | 0.712 | 0.689 | 0.760 | The distilled student improves nDCG@10 by **+4.5 points** over the base while staying within **~1 point of the 568M teacher** — roughly 98% of the teacher's ranking quality at a fraction of the size and latency. (Recall@100 is the BM25 first-stage ceiling and bounds all rerankers.) ### How it compares to hosted commercial rerankers An independent cross-system check on **300 MIRACL-id `dev` queries** (BM25 top-100 → rerank). Every reranker is scored with the **same** harness, the **same** BM25 candidates, and the **same** metric implementation, so the comparison is apples-to-apples. NVIDIA and Cohere were called through the OpenRouter rerank API. | reranker | nDCG@10 | MRR@10 | cost / availability | | --- | --- | --- | --- | | BM25 (no rerank) | 0.393 | 0.330 | — | | **this model** (int8 ONNX, CPU) | 0.655 | 0.633 | **free · local · offline** | | `nvidia/llama-nemotron-rerank-vl-1b-v2` | 0.656 | 0.632 | hosted API | | `cohere/rerank-v3.5` | 0.664 | 0.636 | paid API | | `cohere/rerank-4-pro` | 0.665 | 0.640 | paid API | Takeaways: - **Statistically tied with NVIDIA's hosted reranker** (nDCG@10 0.655 vs 0.656; it is marginally *ahead* on MRR@10) — while running free and offline on CPU. - Within **~0.01 nDCG (~1.5%)** of Cohere's strongest commercial reranker. > Honesty note: the absolute scores in this comparison are slightly lower than the > **0.701** reported above because this is a 300-query slice scored with > flashIndorank's own metric harness, not the full 960-query `pytrec_eval` run. The > **relative** standing (≈ NVIDIA, just under Cohere) is the point. A smaller > 30-query slice was even noisier and is not a reliable signal — prefer these > 300-query (or the full 960) numbers. ## Usage ### sentence-transformers ```python from sentence_transformers import CrossEncoder model = CrossEncoder("madebyaris/rerank-indonesia") query = "Bagaimana cara menurunkan berat badan?" passages = [ "Olahraga teratur dan pola makan sehat membantu mengurangi bobot tubuh.", "Harga emas global naik tajam dalam sepekan terakhir.", ] scores = model.predict([[query, p] for p in passages]) print(scores) ``` ### Lightweight ONNX (int8) via flashIndorank ```python from huggingface_hub import snapshot_download from flashindorank import CustomReranker from flashrank import RerankRequest path = snapshot_download("madebyaris/rerank-indonesia", allow_patterns=["onnx/*"]) ranker = CustomReranker(f"{path}/onnx") out = ranker.rerank(RerankRequest( query="Bagaimana cara menurunkan berat badan?", passages=[{"id": 1, "text": "Olahraga teratur dan pola makan sehat membantu mengurangi bobot tubuh."}], )) print(out) ``` ## Training - Student / base: `cross-encoder/mmarco-mMiniLMv2-L12-H384-v1` - Teacher: `BAAI/bge-reranker-v2-m3` - Method: Margin-MSE knowledge distillation (Hofstätter et al., 2020) — `label = teacher(q, pos) - teacher(q, neg)` - Data: in-domain Indonesian triplets from TyDi QA + MIRACL-id train, BM25 + dense hard negatives - Optimizer: 3 epochs, lr 8e-6, bf16, `MarginMSELoss` (`CrossEncoderTrainer`) See [TRAINING.md](https://github.com/madebyaris/flashIndorank/blob/main/TRAINING.md). ## Roadmap — what's next to improve The model is already quality-competitive with hosted rerankers; the remaining wins, highest-leverage first: 1. **Close the small gap to Cohere (quality).** Re-distill on **combined** data — mMARCO-id (~400k triplets) + in-domain TyDi/MIRACL-id upsampled ~3× so in-domain signal isn't diluted — for a few more epochs. Targets pushing nDCG@10 past 0.70 toward the teacher ceiling. 2. **Stronger / ensemble teacher.** Distill from a larger teacher (e.g. `BAAI/bge-reranker-v2-gemma`) or an ensemble of teacher margins to raise the distillation ceiling above the current ~0.712. 3. **Harder negatives.** Re-mine negatives with a strong *dense* retriever (not just BM25); cross-encoders learn most from hard negatives. 4. **Lift the real ceiling = better first-stage retrieval.** MIRACL nDCG is capped by `Recall@100` (~0.71–0.76 here). A better retriever (multilingual-e5 / BGE-M3 dense, or hybrid BM25+dense) raises the candidates the reranker sees — likely a bigger end-to-end win than any reranker tweak. 5. **Faster CPU serving.** The int8 ONNX is quality-ready; latency is the lever. Length-sorted mini-batching (cut padding waste), an optional multi-threaded ONNX mode, and a lower default `max_length` (256) materially reduce CPU latency and RAM. 6. **Broaden evaluation.** Report the full 960-query MIRACL-id run and add other domains (e-commerce, news) so the quality claim generalizes beyond Wikipedia QA. ## License Apache-2.0, inherited from the base model. TyDi QA and MIRACL are Apache-2.0.