bangla-embed-e5-small (118M)

A 118M-parameter Bangla/English text embedding model, re-based on intfloat/multilingual-e5-small and trained in three stages (cross-lingual distillation with a retrieval-retention anchor → supervised contrastive with gold hard-negatives → NLI polish). It targets Bangla retrieval as the headline use case.

Requires e5-style prompts. Use prompt_name="query" for queries and prompt_name="passage" for documents — accuracy drops sharply without them.

Usage

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("kazalbrur/bangla-embed-e5-small")
q = model.encode(["ঢাকা কোথায়?"], prompt_name="query", normalize_embeddings=True)
d = model.encode(["ঢাকা বাংলাদেশের রাজধানী।"], prompt_name="passage", normalize_embeddings=True)
print(q @ d.T)  # cosine similarity

Evaluation

All numbers below are measured under one harness against strong multilingual / Bangla SOTA baselines; this model is bold. It is the best model ≤120M across the board.

MTEB(Indic, v1) — Bengali subset (official `mteb`, 7 tasks, main score)

Model	Params	Retrieval	Classif.	Bitext-Gen	Bitext-Conv	Rerank	STS	Cluster	Mean
BAAI/bge-m3	568M	0.644	0.879	0.874	0.722	0.852	0.593	0.340	0.700
multilingual-e5-large	560M	0.631	0.847	0.876	0.748	0.852	0.540	0.339	0.690
bangla-embed-e5-small (ours)	118M	0.572	0.848	0.832	0.668	0.840	0.554	0.349	0.666
multilingual-e5-small (base)	118M	0.535	0.832	0.848	0.699	0.835	0.538	0.309	0.656
LaBSE	109M	0.442	0.804	0.849	0.705	0.791	0.583	0.239	0.631
paraphrase-mpnet-base-v2	278M	0.337	0.749	0.618	0.426	0.701	0.355	0.370	0.508

3rd of 6 overall — behind only bge-m3 and e5-large (both ~4.8× larger) — and #1 ≤120M, improving over its e5-small base on 5/7 tasks (Retrieval +0.037, Clustering +0.040, Classif/STS +0.016).

Cross-lingual bn↔en (bitext mining)

Model	Params	FLORES R@1	Tatoeba acc@1
LaBSE	109M	1.000	0.915
multilingual-e5-large	560M	0.999	0.900
BAAI/bge-m3	568M	0.999	0.882
bangla-embed-e5-small (ours)	118M	0.999	0.877
multilingual-e5-small (base)	118M	0.997	0.875
paraphrase-mpnet-base-v2	278M	0.930	0.732

Ties bge-m3 on Tatoeba at ~⅕ the parameters; FLORES is saturated. LaBSE (a bitext specialist) leads.

Bangla retrieval (this model)

Benchmark	Metric	Score
NanoBEIR-bn (zero-shot)	nDCG@10	0.454
MIRACL-bn	nDCG@10 / R@100	0.614 / 0.927
Mr.TyDi-bn	nDCG@10 / R@100	0.614 / 0.905
IndicMSMARCO / MSMARCO-bn	R@10	0.907 / 0.881

Training

Base: intfloat/multilingual-e5-small (12L / H384 / vocab 250k).
Stage 1 — two-view cross-lingual distillation from BAAI/bge-m3 (matryoshka MSE)
- InfoNCE alignment + a functional self-distillation anchor to the frozen e5 base that preserves e5's retrieval pretraining.
Stage 2 — supervised contrastive (MNR) with gold + GPU-mined hard negatives, e5 query:/passage: prefixes, seq 256.
Stage 3 — NLI triplet polish.

Limitations

STS and bitext mining trail bitext specialists (e.g. LaBSE) — this is a retrieval / cross-lingual model, not a graded-similarity model.
No transliteration ("Banglish") coverage — romanized Bangla input is out of domain.
Requires the query: / passage: prompts.

License & data

Released under apache-2.0. Built on multilingual-e5-small (MIT) and distilled from bge-m3 (MIT); trained on public Bangla / parallel corpora. Users should confirm downstream data-license compatibility for their use case.

Downloads last month: 34

Safetensors

Model size

0.1B params

Tensor type

F32

Model tree for kazalbrur/bangla-embed-e5-small

Base model

intfloat/multilingual-e5-small

Finetuned

(167)

this model