uzbek-minilm

A multilingual sentence-embedding model fine-tuned from sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 for Uzbek semantic search and retrieval, including cross-lingual uz↔en.

This model demonstrates the large gain achievable when the base model is weak at the target language. The base MiniLM handles Uzbek poorly (Recall@1 = 0.26); after one epoch on Uzbek pairs it reaches Recall@1 = 0.97. The companion flagship sukhrobnurali/uzbek-e5-small starts from a stronger base and is the recommended model for cross-lingual work.

Intended use

  • Uzbek semantic search / passage retrieval (RAG over Uzbek documents).
  • Cross-lingual uz↔en retrieval and bitext alignment.
  • General-purpose Uzbek sentence similarity and clustering.

Unlike e5, this model uses no input prefixes — encode queries and documents as plain text. Embeddings are L2-normalized; compare with cosine similarity.

Usage

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("sukhrobnurali/uzbek-minilm")

query = ["O'zbekistonning poytaxti qaysi shahar?"]
passages = [
    "Toshkent — O'zbekiston Respublikasining poytaxti va eng yirik shahri.",
    "Samarqand — O'zbekistondagi qadimiy shaharlardan biri.",
]

q_emb = model.encode(query, normalize_embeddings=True)
p_emb = model.encode(passages, normalize_embeddings=True)
scores = q_emb @ p_emb.T
print(scores)  # highest score on the Tashkent passage

Requires sentence-transformers>=5.5.1 — the version the model was saved with. Older versions cannot load it (ModuleNotFoundError: No module named 'sentence_transformers.base').

Evaluation

The same protocol is applied to the base and fine-tuned models so the delta is a fair comparison. Both held-out sets are disjoint from training: the retrieval split is the dataset's wiki_retrieval_eval/test; FLORES+ training only ever sees dev (via the dataset's validation split), so devtest stays clean.

Monolingual Uzbek retrieval — wiki_retrieval_eval/test (5,000 title→paragraph)

Metric Base Fine-tuned Δ
Recall@1 0.2564 0.9692 +0.7128
Recall@5 0.3742 0.9854 +0.6112
Recall@10 0.4314 0.9892 +0.5578
MRR@10 0.3072 0.9765 +0.6693
nDCG@10 0.3367 0.9796 +0.6429

Cross-lingual uz↔en bitext — FLORES+ devtest (1,012 pairs)

Metric Base Fine-tuned Δ
uz→en accuracy 0.4575 0.8745 +0.4170
en→uz accuracy 0.4872 0.8251 +0.3379
Mean accuracy 0.4723 0.8498 +0.3775

Fine-tuning nearly closes the monolingual gap — the fine-tuned MiniLM (R@1 = 0.969) almost matches the e5 baseline (0.987). On cross-lingual FLORES it still trails the e5 family (0.85 vs 0.99), so prefer uzbek-e5-small when uz↔en accuracy matters most.

Training data

sukhrobnurali/uzbek-embedding-pairs — 356,278 (anchor, positive) pairs in the default/train split, used as-is:

Source Share Pair type
Uzbek Wikipedia ~56% title ↔ paragraph
OPUS-100 uz↔en ~34% parallel sentence ↔ translation
Latin↔Cyrillic ~10% same sentence, two scripts

No prefixes are applied for this model family.

Limitations

  • Low-resource language: coverage is thinner than for high-resource languages, and rare domains (legal, medical, dialectal) are under-represented.
  • Part of the data is OPUS-100, which carries machine-translation noise.
  • The monolingual eval is a title→paragraph proxy for retrieval, not a curated query set.
  • Trained on Latin-script Uzbek; Cyrillic appears only via the script-pair source, so Cyrillic retrieval quality is less thoroughly evaluated.
  • Cross-lingual uz↔en accuracy trails the e5-based model; use uzbek-e5-small for that.

Reproducibility

Fixed seed (42); 1 epoch of MultipleNegativesRankingLoss with in-batch negatives, batch size 192, lr 2e-5, 10% warmup, max_seq_length=192, bf16 on Ampere. All hyperparameters live in config.py; training and evaluation scripts are in the training repository.

License

Apache 2.0, inherited from the base model paraphrase-multilingual-MiniLM-L12-v2.

Downloads last month
35
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for sukhrobnurali/uzbek-minilm

Dataset used to train sukhrobnurali/uzbek-minilm

Evaluation results

  • Recall@1 on uzbek-embedding-pairs (wiki_retrieval_eval/test, 5k held-out)
    self-reported
    0.969
  • Recall@5 on uzbek-embedding-pairs (wiki_retrieval_eval/test, 5k held-out)
    self-reported
    0.985
  • Recall@10 on uzbek-embedding-pairs (wiki_retrieval_eval/test, 5k held-out)
    self-reported
    0.989
  • MRR@10 on uzbek-embedding-pairs (wiki_retrieval_eval/test, 5k held-out)
    self-reported
    0.977
  • nDCG@10 on uzbek-embedding-pairs (wiki_retrieval_eval/test, 5k held-out)
    self-reported
    0.980
  • Mean accuracy (both directions) on FLORES+ devtest (uzn_Latn / eng_Latn, 1012 pairs)
    self-reported
    0.850