uzbek-minilm

A multilingual sentence-embedding model fine-tuned from sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 for Uzbek semantic search and retrieval, including cross-lingual uz↔en.

This model demonstrates the large gain achievable when the base model is weak at the target language. The base MiniLM handles Uzbek poorly (Recall@1 = 0.26); after one epoch on Uzbek pairs it reaches Recall@1 = 0.97. The companion flagship sukhrobnurali/uzbek-e5-small starts from a stronger base and is the recommended model for cross-lingual work.

Base model: paraphrase-multilingual-MiniLM-L12-v2 (118M params, 384-dim, Apache 2.0)
Language: Uzbek (Latin), with English for cross-lingual pairs
Training data: sukhrobnurali/uzbek-embedding-pairs
Objective: MultipleNegativesRankingLoss, 1 epoch
Training code: https://github.com/sukhrobnurali/uz-sentance-embedding

Intended use

Uzbek semantic search / passage retrieval (RAG over Uzbek documents).
Cross-lingual uz↔en retrieval and bitext alignment.
General-purpose Uzbek sentence similarity and clustering.

Unlike e5, this model uses no input prefixes — encode queries and documents as plain text. Embeddings are L2-normalized; compare with cosine similarity.

Usage

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("sukhrobnurali/uzbek-minilm")

query = ["O'zbekistonning poytaxti qaysi shahar?"]
passages = [
    "Toshkent — O'zbekiston Respublikasining poytaxti va eng yirik shahri.",
    "Samarqand — O'zbekistondagi qadimiy shaharlardan biri.",
]

q_emb = model.encode(query, normalize_embeddings=True)
p_emb = model.encode(passages, normalize_embeddings=True)
scores = q_emb @ p_emb.T
print(scores)  # highest score on the Tashkent passage

Requires sentence-transformers>=5.5.1 — the version the model was saved with. Older versions cannot load it (ModuleNotFoundError: No module named 'sentence_transformers.base').

Evaluation

The same protocol is applied to the base and fine-tuned models so the delta is a fair comparison. Both held-out sets are disjoint from training: the retrieval split is the dataset's wiki_retrieval_eval/test; FLORES+ training only ever sees dev (via the dataset's validation split), so devtest stays clean.

Monolingual Uzbek retrieval — `wiki_retrieval_eval/test` (5,000 title→paragraph)

Metric	Base	Fine-tuned	Δ
Recall@1	0.2564	0.9692	+0.7128
Recall@5	0.3742	0.9854	+0.6112
Recall@10	0.4314	0.9892	+0.5578
MRR@10	0.3072	0.9765	+0.6693
nDCG@10	0.3367	0.9796	+0.6429

Cross-lingual uz↔en bitext — FLORES+ `devtest` (1,012 pairs)

Metric	Base	Fine-tuned	Δ
uz→en accuracy	0.4575	0.8745	+0.4170
en→uz accuracy	0.4872	0.8251	+0.3379
Mean accuracy	0.4723	0.8498	+0.3775

Fine-tuning nearly closes the monolingual gap — the fine-tuned MiniLM (R@1 = 0.969) almost matches the e5 baseline (0.987). On cross-lingual FLORES it still trails the e5 family (0.85 vs 0.99), so prefer uzbek-e5-small when uz↔en accuracy matters most.

Training data

sukhrobnurali/uzbek-embedding-pairs — 356,278 (anchor, positive) pairs in the default/train split, used as-is:

Source	Share	Pair type
Uzbek Wikipedia	~56%	title ↔ paragraph
OPUS-100 uz↔en	~34%	parallel sentence ↔ translation
Latin↔Cyrillic	~10%	same sentence, two scripts

No prefixes are applied for this model family.

Limitations

Low-resource language: coverage is thinner than for high-resource languages, and rare domains (legal, medical, dialectal) are under-represented.
Part of the data is OPUS-100, which carries machine-translation noise.
The monolingual eval is a title→paragraph proxy for retrieval, not a curated query set.
Trained on Latin-script Uzbek; Cyrillic appears only via the script-pair source, so Cyrillic retrieval quality is less thoroughly evaluated.
Cross-lingual uz↔en accuracy trails the e5-based model; use uzbek-e5-small for that.

Reproducibility

Fixed seed (42); 1 epoch of MultipleNegativesRankingLoss with in-batch negatives, batch size 192, lr 2e-5, 10% warmup, max_seq_length=192, bf16 on Ampere. All hyperparameters live in config.py; training and evaluation scripts are in the training repository.

License

Apache 2.0, inherited from the base model paraphrase-multilingual-MiniLM-L12-v2.

Downloads last month: 35

Safetensors

Model size

0.1B params

Tensor type

F32

Model tree for sukhrobnurali/uzbek-minilm

Base model

sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2

Finetuned

(306)

this model

Dataset used to train sukhrobnurali/uzbek-minilm

Evaluation results

Recall@1 on uzbek-embedding-pairs (wiki_retrieval_eval/test, 5k held-out)
self-reported

0.969
Recall@5 on uzbek-embedding-pairs (wiki_retrieval_eval/test, 5k held-out)
self-reported

0.985
Recall@10 on uzbek-embedding-pairs (wiki_retrieval_eval/test, 5k held-out)
self-reported

0.989
MRR@10 on uzbek-embedding-pairs (wiki_retrieval_eval/test, 5k held-out)
self-reported

0.977
nDCG@10 on uzbek-embedding-pairs (wiki_retrieval_eval/test, 5k held-out)
self-reported

0.980
Mean accuracy (both directions) on FLORES+ devtest (uzn_Latn / eng_Latn, 1012 pairs)
self-reported

0.850