rerank-indonesia

A lightweight Indonesian (Bahasa Indonesia) cross-encoder reranker, small enough to serve on a cheap CPU VPS yet competitive with a 17× larger model.

It is built by Margin-MSE knowledge distillation: a strong multilingual teacher, BAAI/bge-reranker-v2-m3 (568M params), supervises the tiny student cross-encoder/mmarco-mMiniLMv2-L12-H384-v1 on in-domain Indonesian (query, positive, negative) triplets from TyDi QA and MIRACL-id (with BM25 + dense hard-negative mining). The student learns the teacher's score margin between relevant and non-relevant passages.

Built as part of flashIndorank.

Evaluation

MIRACL-id official retrieve-then-rerank protocol (BM25 top-100 → rerank, 960 dev queries, pytrec_eval):

model params nDCG@10 MRR@10 Recall@100
cross-encoder/mmarco-mMiniLMv2-L12-H384-v1 (base) tiny 0.656 0.623 0.760
this model (in-domain distillation) tiny 0.701 0.677 0.760
BAAI/bge-reranker-v2-m3 (teacher) 568M 0.712 0.689 0.760

The distilled student improves nDCG@10 by +4.5 points over the base while staying within ~1 point of the 568M teacher — roughly 98% of the teacher's ranking quality at a fraction of the size and latency. (Recall@100 is the BM25 first-stage ceiling and bounds all rerankers.)

How it compares to hosted commercial rerankers

An independent cross-system check on 300 MIRACL-id dev queries (BM25 top-100 → rerank). Every reranker is scored with the same harness, the same BM25 candidates, and the same metric implementation, so the comparison is apples-to-apples. NVIDIA and Cohere were called through the OpenRouter rerank API.

reranker nDCG@10 MRR@10 cost / availability
BM25 (no rerank) 0.393 0.330
this model (int8 ONNX, CPU) 0.655 0.633 free · local · offline
nvidia/llama-nemotron-rerank-vl-1b-v2 0.656 0.632 hosted API
cohere/rerank-v3.5 0.664 0.636 paid API
cohere/rerank-4-pro 0.665 0.640 paid API

Takeaways:

  • Statistically tied with NVIDIA's hosted reranker (nDCG@10 0.655 vs 0.656; it is marginally ahead on MRR@10) — while running free and offline on CPU.
  • Within 0.01 nDCG (1.5%) of Cohere's strongest commercial reranker.

Honesty note: the absolute scores in this comparison are slightly lower than the 0.701 reported above because this is a 300-query slice scored with flashIndorank's own metric harness, not the full 960-query pytrec_eval run. The relative standing (≈ NVIDIA, just under Cohere) is the point. A smaller 30-query slice was even noisier and is not a reliable signal — prefer these 300-query (or the full 960) numbers.

Usage

sentence-transformers

from sentence_transformers import CrossEncoder

model = CrossEncoder("madebyaris/rerank-indonesia")
query = "Bagaimana cara menurunkan berat badan?"
passages = [
    "Olahraga teratur dan pola makan sehat membantu mengurangi bobot tubuh.",
    "Harga emas global naik tajam dalam sepekan terakhir.",
]
scores = model.predict([[query, p] for p in passages])
print(scores)

Lightweight ONNX (int8) via flashIndorank

from huggingface_hub import snapshot_download
from flashindorank import CustomReranker
from flashrank import RerankRequest

path = snapshot_download("madebyaris/rerank-indonesia", allow_patterns=["onnx/*"])
ranker = CustomReranker(f"{path}/onnx")
out = ranker.rerank(RerankRequest(
    query="Bagaimana cara menurunkan berat badan?",
    passages=[{"id": 1, "text": "Olahraga teratur dan pola makan sehat membantu mengurangi bobot tubuh."}],
))
print(out)

Training

  • Student / base: cross-encoder/mmarco-mMiniLMv2-L12-H384-v1
  • Teacher: BAAI/bge-reranker-v2-m3
  • Method: Margin-MSE knowledge distillation (Hofstätter et al., 2020) — label = teacher(q, pos) - teacher(q, neg)
  • Data: in-domain Indonesian triplets from TyDi QA + MIRACL-id train, BM25 + dense hard negatives
  • Optimizer: 3 epochs, lr 8e-6, bf16, MarginMSELoss (CrossEncoderTrainer)

See TRAINING.md.

Roadmap — what's next to improve

The model is already quality-competitive with hosted rerankers; the remaining wins, highest-leverage first:

  1. Close the small gap to Cohere (quality). Re-distill on combined data — mMARCO-id (~400k triplets) + in-domain TyDi/MIRACL-id upsampled ~3× so in-domain signal isn't diluted — for a few more epochs. Targets pushing nDCG@10 past 0.70 toward the teacher ceiling.
  2. Stronger / ensemble teacher. Distill from a larger teacher (e.g. BAAI/bge-reranker-v2-gemma) or an ensemble of teacher margins to raise the distillation ceiling above the current ~0.712.
  3. Harder negatives. Re-mine negatives with a strong dense retriever (not just BM25); cross-encoders learn most from hard negatives.
  4. Lift the real ceiling = better first-stage retrieval. MIRACL nDCG is capped by Recall@100 (~0.71–0.76 here). A better retriever (multilingual-e5 / BGE-M3 dense, or hybrid BM25+dense) raises the candidates the reranker sees — likely a bigger end-to-end win than any reranker tweak.
  5. Faster CPU serving. The int8 ONNX is quality-ready; latency is the lever. Length-sorted mini-batching (cut padding waste), an optional multi-threaded ONNX mode, and a lower default max_length (256) materially reduce CPU latency and RAM.
  6. Broaden evaluation. Report the full 960-query MIRACL-id run and add other domains (e-commerce, news) so the quality claim generalizes beyond Wikipedia QA.

License

Apache-2.0, inherited from the base model. TyDi QA and MIRACL are Apache-2.0.

Downloads last month
159
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for madebyaris/rerank-indonesia

Datasets used to train madebyaris/rerank-indonesia