uzbek-e5-small / README.md
sukhrobnurali's picture
Document sentence-transformers>=5.5.1 requirement
f3a557b verified
metadata
language:
  - uz
  - en
license: mit
library_name: sentence-transformers
pipeline_tag: sentence-similarity
base_model: intfloat/multilingual-e5-small
datasets:
  - sukhrobnurali/uzbek-embedding-pairs
tags:
  - sentence-transformers
  - sentence-similarity
  - feature-extraction
  - embeddings
  - uzbek
  - retrieval
  - e5
model-index:
  - name: uzbek-e5-small
    results:
      - task:
          type: information-retrieval
          name: Monolingual Uzbek retrieval (Wikipedia title to paragraph)
        dataset:
          type: sukhrobnurali/uzbek-embedding-pairs
          name: uzbek-embedding-pairs (wiki_retrieval_eval/test, 5k held-out)
        metrics:
          - type: recall_at_1
            value: 0.9914
            name: Recall@1
          - type: recall_at_5
            value: 0.9964
            name: Recall@5
          - type: recall_at_10
            value: 0.9972
            name: Recall@10
          - type: mrr_at_10
            value: 0.9936
            name: MRR@10
          - type: ndcg_at_10
            value: 0.9945
            name: nDCG@10
      - task:
          type: bitext-mining
          name: Cross-lingual uz-en bitext retrieval (FLORES+ devtest)
        dataset:
          type: openlanguagedata/flores_plus
          name: FLORES+ devtest (uzn_Latn / eng_Latn, 1012 pairs)
        metrics:
          - type: accuracy
            value: 0.9876
            name: Mean accuracy (both directions)

uzbek-e5-small

A small multilingual sentence-embedding model fine-tuned from intfloat/multilingual-e5-small for Uzbek semantic search and retrieval, including cross-lingual uz↔en.

This is the flagship model of a two-base study. Its base, multilingual-e5-small, is already strong at Uzbek, so fine-tuning yields a small but consistently positive gain — the honest result for a near-ceiling starting point. The companion sukhrobnurali/uzbek-minilm starts from a much weaker base and shows the large delta.

Intended use

  • Uzbek semantic search / passage retrieval (RAG over Uzbek documents).
  • Cross-lingual uz↔en retrieval and bitext alignment.
  • General-purpose Uzbek sentence similarity and clustering.

Prefixes (important)

Like the base e5 model, this model expects task prefixes:

  • Retrieval: prefix queries with query: and documents with passage: .
  • Symmetric tasks (similarity, bitext): use query: on both sides.

Embeddings are L2-normalized; compare with cosine similarity (dot product on normalized vectors).

Usage

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("sukhrobnurali/uzbek-e5-small")

queries = ["query: O'zbekistonning poytaxti qaysi shahar?"]
passages = [
    "passage: Toshkent — O'zbekiston Respublikasining poytaxti va eng yirik shahri.",
    "passage: Samarqand — O'zbekistondagi qadimiy shaharlardan biri.",
]

q_emb = model.encode(queries, normalize_embeddings=True)
p_emb = model.encode(passages, normalize_embeddings=True)
scores = q_emb @ p_emb.T
print(scores)  # highest score on the Tashkent passage

Requires sentence-transformers>=5.5.1 — the version the model was saved with. Older versions cannot load it (ModuleNotFoundError: No module named 'sentence_transformers.base').

Evaluation

The same protocol is applied to the base and fine-tuned models so the delta is a fair comparison. Both held-out sets are disjoint from training: the retrieval split is the dataset's wiki_retrieval_eval/test; FLORES+ training only ever sees dev (via the dataset's validation split), so devtest stays clean.

Monolingual Uzbek retrieval — wiki_retrieval_eval/test (5,000 title→paragraph)

Metric Base Fine-tuned Δ
Recall@1 0.9868 0.9914 +0.0046
Recall@5 0.9948 0.9964 +0.0016
Recall@10 0.9962 0.9972 +0.0010
MRR@10 0.9904 0.9936 +0.0032
nDCG@10 0.9918 0.9945 +0.0027

Cross-lingual uz↔en bitext — FLORES+ devtest (1,012 pairs)

Metric Base Fine-tuned Δ
uz→en accuracy 0.9713 0.9901 +0.0188
en→uz accuracy 0.9852 0.9852 +0.0000
Mean accuracy 0.9783 0.9876 +0.0094

The base is already near the ceiling for Uzbek, so the gains are small but every metric improves — the fine-tuned model dominates the base and is the one shipped.

Training data

sukhrobnurali/uzbek-embedding-pairs — 356,278 (anchor, positive) pairs in the default/train split, used as-is:

Source Share Pair type
Uzbek Wikipedia ~56% title ↔ paragraph
OPUS-100 uz↔en ~34% parallel sentence ↔ translation
Latin↔Cyrillic ~10% same sentence, two scripts

Anchors are prefixed query: and positives passage: at training time, matching the retrieval framing used at inference.

Limitations

  • Low-resource language: coverage is thinner than for high-resource languages, and rare domains (legal, medical, dialectal) are under-represented.
  • Part of the data is OPUS-100, which carries machine-translation noise.
  • The monolingual eval is a title→paragraph proxy for retrieval, not a curated query set.
  • Trained on Latin-script Uzbek; Cyrillic appears only via the script-pair source, so Cyrillic retrieval quality is less thoroughly evaluated.

Reproducibility

Fixed seed (42); 1 epoch of MultipleNegativesRankingLoss with in-batch negatives, batch size 192, lr 2e-5, 10% warmup, max_seq_length=192, bf16 on Ampere. All hyperparameters live in config.py; training and evaluation scripts are in the training repository.

License

MIT, inherited from the base model intfloat/multilingual-e5-small.