bangla-embed-e5-small (118M)

A 118M-parameter Bangla/English text embedding model, re-based on intfloat/multilingual-e5-small and trained in three stages (cross-lingual distillation with a retrieval-retention anchor → supervised contrastive with gold hard-negatives → NLI polish). It targets Bangla retrieval as the headline use case.

Requires e5-style prompts. Use prompt_name="query" for queries and prompt_name="passage" for documents — accuracy drops sharply without them.

Usage

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("kazalbrur/bangla-embed-e5-small")
q = model.encode(["ঢাকা কোথায়?"], prompt_name="query", normalize_embeddings=True)
d = model.encode(["ঢাকা বাংলাদেশের রাজধানী।"], prompt_name="passage", normalize_embeddings=True)
print(q @ d.T)  # cosine similarity

Evaluation

All numbers below are measured under one harness against strong multilingual / Bangla SOTA baselines; this model is bold. It is the best model ≤120M across the board.

MTEB(Indic, v1) — Bengali subset (official mteb, 7 tasks, main score)

Model Params Retrieval Classif. Bitext-Gen Bitext-Conv Rerank STS Cluster Mean
BAAI/bge-m3 568M 0.644 0.879 0.874 0.722 0.852 0.593 0.340 0.700
multilingual-e5-large 560M 0.631 0.847 0.876 0.748 0.852 0.540 0.339 0.690
bangla-embed-e5-small (ours) 118M 0.572 0.848 0.832 0.668 0.840 0.554 0.349 0.666
multilingual-e5-small (base) 118M 0.535 0.832 0.848 0.699 0.835 0.538 0.309 0.656
LaBSE 109M 0.442 0.804 0.849 0.705 0.791 0.583 0.239 0.631
paraphrase-mpnet-base-v2 278M 0.337 0.749 0.618 0.426 0.701 0.355 0.370 0.508

3rd of 6 overall — behind only bge-m3 and e5-large (both ~4.8× larger) — and #1 ≤120M, improving over its e5-small base on 5/7 tasks (Retrieval +0.037, Clustering +0.040, Classif/STS +0.016).

Cross-lingual bn↔en (bitext mining)

Model Params FLORES R@1 Tatoeba acc@1
LaBSE 109M 1.000 0.915
multilingual-e5-large 560M 0.999 0.900
BAAI/bge-m3 568M 0.999 0.882
bangla-embed-e5-small (ours) 118M 0.999 0.877
multilingual-e5-small (base) 118M 0.997 0.875
paraphrase-mpnet-base-v2 278M 0.930 0.732

Ties bge-m3 on Tatoeba at ~⅕ the parameters; FLORES is saturated. LaBSE (a bitext specialist) leads.

Bangla retrieval (this model)

Benchmark Metric Score
NanoBEIR-bn (zero-shot) nDCG@10 0.454
MIRACL-bn nDCG@10 / R@100 0.614 / 0.927
Mr.TyDi-bn nDCG@10 / R@100 0.614 / 0.905
IndicMSMARCO / MSMARCO-bn R@10 0.907 / 0.881

Training

  • Base: intfloat/multilingual-e5-small (12L / H384 / vocab 250k).
  • Stage 1 — two-view cross-lingual distillation from BAAI/bge-m3 (matryoshka MSE)
    • InfoNCE alignment + a functional self-distillation anchor to the frozen e5 base that preserves e5's retrieval pretraining.
  • Stage 2 — supervised contrastive (MNR) with gold + GPU-mined hard negatives, e5 query:/passage: prefixes, seq 256.
  • Stage 3 — NLI triplet polish.

Limitations

  • STS and bitext mining trail bitext specialists (e.g. LaBSE) — this is a retrieval / cross-lingual model, not a graded-similarity model.
  • No transliteration ("Banglish") coverage — romanized Bangla input is out of domain.
  • Requires the query: / passage: prompts.

License & data

Released under apache-2.0. Built on multilingual-e5-small (MIT) and distilled from bge-m3 (MIT); trained on public Bangla / parallel corpora. Users should confirm downstream data-license compatibility for their use case.

Downloads last month
34
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for kazalbrur/bangla-embed-e5-small

Finetuned
(167)
this model