uzbek-e5-small

A small multilingual sentence-embedding model fine-tuned from intfloat/multilingual-e5-small for Uzbek semantic search and retrieval, including cross-lingual uz↔en.

This is the flagship model of a two-base study. Its base, multilingual-e5-small, is already strong at Uzbek, so fine-tuning yields a small but consistently positive gain — the honest result for a near-ceiling starting point. The companion sukhrobnurali/uzbek-minilm starts from a much weaker base and shows the large delta.

Intended use

  • Uzbek semantic search / passage retrieval (RAG over Uzbek documents).
  • Cross-lingual uz↔en retrieval and bitext alignment.
  • General-purpose Uzbek sentence similarity and clustering.

Prefixes (important)

Like the base e5 model, this model expects task prefixes:

  • Retrieval: prefix queries with query: and documents with passage: .
  • Symmetric tasks (similarity, bitext): use query: on both sides.

Embeddings are L2-normalized; compare with cosine similarity (dot product on normalized vectors).

Usage

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("sukhrobnurali/uzbek-e5-small")

queries = ["query: O'zbekistonning poytaxti qaysi shahar?"]
passages = [
    "passage: Toshkent — O'zbekiston Respublikasining poytaxti va eng yirik shahri.",
    "passage: Samarqand — O'zbekistondagi qadimiy shaharlardan biri.",
]

q_emb = model.encode(queries, normalize_embeddings=True)
p_emb = model.encode(passages, normalize_embeddings=True)
scores = q_emb @ p_emb.T
print(scores)  # highest score on the Tashkent passage

Requires sentence-transformers>=5.5.1 — the version the model was saved with. Older versions cannot load it (ModuleNotFoundError: No module named 'sentence_transformers.base').

Evaluation

The same protocol is applied to the base and fine-tuned models so the delta is a fair comparison. Both held-out sets are disjoint from training: the retrieval split is the dataset's wiki_retrieval_eval/test; FLORES+ training only ever sees dev (via the dataset's validation split), so devtest stays clean.

Monolingual Uzbek retrieval — wiki_retrieval_eval/test (5,000 title→paragraph)

Metric Base Fine-tuned Δ
Recall@1 0.9868 0.9914 +0.0046
Recall@5 0.9948 0.9964 +0.0016
Recall@10 0.9962 0.9972 +0.0010
MRR@10 0.9904 0.9936 +0.0032
nDCG@10 0.9918 0.9945 +0.0027

Cross-lingual uz↔en bitext — FLORES+ devtest (1,012 pairs)

Metric Base Fine-tuned Δ
uz→en accuracy 0.9713 0.9901 +0.0188
en→uz accuracy 0.9852 0.9852 +0.0000
Mean accuracy 0.9783 0.9876 +0.0094

The base is already near the ceiling for Uzbek, so the gains are small but every metric improves — the fine-tuned model dominates the base and is the one shipped.

Training data

sukhrobnurali/uzbek-embedding-pairs — 356,278 (anchor, positive) pairs in the default/train split, used as-is:

Source Share Pair type
Uzbek Wikipedia ~56% title ↔ paragraph
OPUS-100 uz↔en ~34% parallel sentence ↔ translation
Latin↔Cyrillic ~10% same sentence, two scripts

Anchors are prefixed query: and positives passage: at training time, matching the retrieval framing used at inference.

Limitations

  • Low-resource language: coverage is thinner than for high-resource languages, and rare domains (legal, medical, dialectal) are under-represented.
  • Part of the data is OPUS-100, which carries machine-translation noise.
  • The monolingual eval is a title→paragraph proxy for retrieval, not a curated query set.
  • Trained on Latin-script Uzbek; Cyrillic appears only via the script-pair source, so Cyrillic retrieval quality is less thoroughly evaluated.

Reproducibility

Fixed seed (42); 1 epoch of MultipleNegativesRankingLoss with in-batch negatives, batch size 192, lr 2e-5, 10% warmup, max_seq_length=192, bf16 on Ampere. All hyperparameters live in config.py; training and evaluation scripts are in the training repository.

License

MIT, inherited from the base model intfloat/multilingual-e5-small.

Downloads last month
39
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for sukhrobnurali/uzbek-e5-small

Finetuned
(164)
this model

Dataset used to train sukhrobnurali/uzbek-e5-small

Evaluation results

  • Recall@1 on uzbek-embedding-pairs (wiki_retrieval_eval/test, 5k held-out)
    self-reported
    0.991
  • Recall@5 on uzbek-embedding-pairs (wiki_retrieval_eval/test, 5k held-out)
    self-reported
    0.996
  • Recall@10 on uzbek-embedding-pairs (wiki_retrieval_eval/test, 5k held-out)
    self-reported
    0.997
  • MRR@10 on uzbek-embedding-pairs (wiki_retrieval_eval/test, 5k held-out)
    self-reported
    0.994
  • nDCG@10 on uzbek-embedding-pairs (wiki_retrieval_eval/test, 5k held-out)
    self-reported
    0.995
  • Mean accuracy (both directions) on FLORES+ devtest (uzn_Latn / eng_Latn, 1012 pairs)
    self-reported
    0.988