Instructions to use kazalbrur/bangla-embed-e5-small with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- sentence-transformers
How to use kazalbrur/bangla-embed-e5-small with sentence-transformers:
from sentence_transformers import SentenceTransformer model = SentenceTransformer("kazalbrur/bangla-embed-e5-small") sentences = [ "সে একজন সুখী ব্যক্তি", "সে হ্যাপি কুকুর", "সে খুব সুখী মানুষ", "আজ একটি রৌদ্রোজ্জ্বল দিন" ] embeddings = model.encode(sentences) similarities = model.similarity(embeddings, embeddings) print(similarities.shape) # [4, 4] - Notebooks
- Google Colab
- Kaggle
bangla-embed-e5-small (118M)
A 118M-parameter Bangla/English text embedding model, re-based on
intfloat/multilingual-e5-small and trained in three stages (cross-lingual
distillation with a retrieval-retention anchor → supervised contrastive with gold
hard-negatives → NLI polish). It targets Bangla retrieval as the headline use case.
Requires e5-style prompts. Use
prompt_name="query"for queries andprompt_name="passage"for documents — accuracy drops sharply without them.
Usage
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("kazalbrur/bangla-embed-e5-small")
q = model.encode(["ঢাকা কোথায়?"], prompt_name="query", normalize_embeddings=True)
d = model.encode(["ঢাকা বাংলাদেশের রাজধানী।"], prompt_name="passage", normalize_embeddings=True)
print(q @ d.T) # cosine similarity
Evaluation
All numbers below are measured under one harness against strong multilingual / Bangla SOTA baselines; this model is bold. It is the best model ≤120M across the board.
MTEB(Indic, v1) — Bengali subset (official mteb, 7 tasks, main score)
| Model | Params | Retrieval | Classif. | Bitext-Gen | Bitext-Conv | Rerank | STS | Cluster | Mean |
|---|---|---|---|---|---|---|---|---|---|
| BAAI/bge-m3 | 568M | 0.644 | 0.879 | 0.874 | 0.722 | 0.852 | 0.593 | 0.340 | 0.700 |
| multilingual-e5-large | 560M | 0.631 | 0.847 | 0.876 | 0.748 | 0.852 | 0.540 | 0.339 | 0.690 |
| bangla-embed-e5-small (ours) | 118M | 0.572 | 0.848 | 0.832 | 0.668 | 0.840 | 0.554 | 0.349 | 0.666 |
| multilingual-e5-small (base) | 118M | 0.535 | 0.832 | 0.848 | 0.699 | 0.835 | 0.538 | 0.309 | 0.656 |
| LaBSE | 109M | 0.442 | 0.804 | 0.849 | 0.705 | 0.791 | 0.583 | 0.239 | 0.631 |
| paraphrase-mpnet-base-v2 | 278M | 0.337 | 0.749 | 0.618 | 0.426 | 0.701 | 0.355 | 0.370 | 0.508 |
3rd of 6 overall — behind only bge-m3 and e5-large (both ~4.8× larger) — and #1 ≤120M, improving over its e5-small base on 5/7 tasks (Retrieval +0.037, Clustering +0.040, Classif/STS +0.016).
Cross-lingual bn↔en (bitext mining)
| Model | Params | FLORES R@1 | Tatoeba acc@1 |
|---|---|---|---|
| LaBSE | 109M | 1.000 | 0.915 |
| multilingual-e5-large | 560M | 0.999 | 0.900 |
| BAAI/bge-m3 | 568M | 0.999 | 0.882 |
| bangla-embed-e5-small (ours) | 118M | 0.999 | 0.877 |
| multilingual-e5-small (base) | 118M | 0.997 | 0.875 |
| paraphrase-mpnet-base-v2 | 278M | 0.930 | 0.732 |
Ties bge-m3 on Tatoeba at ~⅕ the parameters; FLORES is saturated. LaBSE (a bitext specialist) leads.
Bangla retrieval (this model)
| Benchmark | Metric | Score |
|---|---|---|
| NanoBEIR-bn (zero-shot) | nDCG@10 | 0.454 |
| MIRACL-bn | nDCG@10 / R@100 | 0.614 / 0.927 |
| Mr.TyDi-bn | nDCG@10 / R@100 | 0.614 / 0.905 |
| IndicMSMARCO / MSMARCO-bn | R@10 | 0.907 / 0.881 |
Training
- Base:
intfloat/multilingual-e5-small(12L / H384 / vocab 250k). - Stage 1 — two-view cross-lingual distillation from
BAAI/bge-m3(matryoshka MSE)- InfoNCE alignment + a functional self-distillation anchor to the frozen e5 base that preserves e5's retrieval pretraining.
- Stage 2 — supervised contrastive (MNR) with gold + GPU-mined hard negatives,
e5
query:/passage:prefixes, seq 256. - Stage 3 — NLI triplet polish.
Limitations
- STS and bitext mining trail bitext specialists (e.g. LaBSE) — this is a retrieval / cross-lingual model, not a graded-similarity model.
- No transliteration ("Banglish") coverage — romanized Bangla input is out of domain.
- Requires the
query:/passage:prompts.
License & data
Released under apache-2.0. Built on multilingual-e5-small (MIT) and distilled from
bge-m3 (MIT); trained on public Bangla / parallel corpora. Users should confirm
downstream data-license compatibility for their use case.
- Downloads last month
- 34
Model tree for kazalbrur/bangla-embed-e5-small
Base model
intfloat/multilingual-e5-small