Instructions to use sukhrobnurali/uzbek-minilm with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- sentence-transformers
How to use sukhrobnurali/uzbek-minilm with sentence-transformers:
from sentence_transformers import SentenceTransformer model = SentenceTransformer("sukhrobnurali/uzbek-minilm") sentences = [ "The weather is lovely today.", "It's so sunny outside!", "He drove to the stadium." ] embeddings = model.encode(sentences) similarities = model.similarity(embeddings, embeddings) print(similarities.shape) # [3, 3] - Notebooks
- Google Colab
- Kaggle
uzbek-minilm
A multilingual sentence-embedding model fine-tuned from
sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
for Uzbek semantic search and retrieval, including cross-lingual uz↔en.
This model demonstrates the large gain achievable when the base model is weak at the
target language. The base MiniLM handles Uzbek poorly (Recall@1 = 0.26); after one epoch
on Uzbek pairs it reaches Recall@1 = 0.97. The companion flagship
sukhrobnurali/uzbek-e5-small
starts from a stronger base and is the recommended model for cross-lingual work.
- Base model:
paraphrase-multilingual-MiniLM-L12-v2(118M params, 384-dim, Apache 2.0) - Language: Uzbek (Latin), with English for cross-lingual pairs
- Training data:
sukhrobnurali/uzbek-embedding-pairs - Objective:
MultipleNegativesRankingLoss, 1 epoch - Training code: https://github.com/sukhrobnurali/uz-sentance-embedding
Intended use
- Uzbek semantic search / passage retrieval (RAG over Uzbek documents).
- Cross-lingual uz↔en retrieval and bitext alignment.
- General-purpose Uzbek sentence similarity and clustering.
Unlike e5, this model uses no input prefixes — encode queries and documents as plain text. Embeddings are L2-normalized; compare with cosine similarity.
Usage
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("sukhrobnurali/uzbek-minilm")
query = ["O'zbekistonning poytaxti qaysi shahar?"]
passages = [
"Toshkent — O'zbekiston Respublikasining poytaxti va eng yirik shahri.",
"Samarqand — O'zbekistondagi qadimiy shaharlardan biri.",
]
q_emb = model.encode(query, normalize_embeddings=True)
p_emb = model.encode(passages, normalize_embeddings=True)
scores = q_emb @ p_emb.T
print(scores) # highest score on the Tashkent passage
Requires
sentence-transformers>=5.5.1— the version the model was saved with. Older versions cannot load it (ModuleNotFoundError: No module named 'sentence_transformers.base').
Evaluation
The same protocol is applied to the base and fine-tuned models so the delta is a fair
comparison. Both held-out sets are disjoint from training: the retrieval split is the
dataset's wiki_retrieval_eval/test; FLORES+ training only ever sees dev (via the
dataset's validation split), so devtest stays clean.
Monolingual Uzbek retrieval — wiki_retrieval_eval/test (5,000 title→paragraph)
| Metric | Base | Fine-tuned | Δ |
|---|---|---|---|
| Recall@1 | 0.2564 | 0.9692 | +0.7128 |
| Recall@5 | 0.3742 | 0.9854 | +0.6112 |
| Recall@10 | 0.4314 | 0.9892 | +0.5578 |
| MRR@10 | 0.3072 | 0.9765 | +0.6693 |
| nDCG@10 | 0.3367 | 0.9796 | +0.6429 |
Cross-lingual uz↔en bitext — FLORES+ devtest (1,012 pairs)
| Metric | Base | Fine-tuned | Δ |
|---|---|---|---|
| uz→en accuracy | 0.4575 | 0.8745 | +0.4170 |
| en→uz accuracy | 0.4872 | 0.8251 | +0.3379 |
| Mean accuracy | 0.4723 | 0.8498 | +0.3775 |
Fine-tuning nearly closes the monolingual gap — the fine-tuned MiniLM (R@1 = 0.969) almost
matches the e5 baseline (0.987). On cross-lingual FLORES it still trails the e5 family
(0.85 vs 0.99), so prefer uzbek-e5-small when uz↔en accuracy matters most.
Training data
sukhrobnurali/uzbek-embedding-pairs
— 356,278 (anchor, positive) pairs in the default/train split, used as-is:
| Source | Share | Pair type |
|---|---|---|
| Uzbek Wikipedia | ~56% | title ↔ paragraph |
| OPUS-100 uz↔en | ~34% | parallel sentence ↔ translation |
| Latin↔Cyrillic | ~10% | same sentence, two scripts |
No prefixes are applied for this model family.
Limitations
- Low-resource language: coverage is thinner than for high-resource languages, and rare domains (legal, medical, dialectal) are under-represented.
- Part of the data is OPUS-100, which carries machine-translation noise.
- The monolingual eval is a title→paragraph proxy for retrieval, not a curated query set.
- Trained on Latin-script Uzbek; Cyrillic appears only via the script-pair source, so Cyrillic retrieval quality is less thoroughly evaluated.
- Cross-lingual uz↔en accuracy trails the e5-based model; use
uzbek-e5-smallfor that.
Reproducibility
Fixed seed (42); 1 epoch of MultipleNegativesRankingLoss with in-batch negatives,
batch size 192, lr 2e-5, 10% warmup, max_seq_length=192, bf16 on Ampere. All
hyperparameters live in config.py; training and evaluation scripts are in the
training repository.
License
Apache 2.0, inherited from the base model paraphrase-multilingual-MiniLM-L12-v2.
- Downloads last month
- 35
Model tree for sukhrobnurali/uzbek-minilm
Dataset used to train sukhrobnurali/uzbek-minilm
Evaluation results
- Recall@1 on uzbek-embedding-pairs (wiki_retrieval_eval/test, 5k held-out)self-reported0.969
- Recall@5 on uzbek-embedding-pairs (wiki_retrieval_eval/test, 5k held-out)self-reported0.985
- Recall@10 on uzbek-embedding-pairs (wiki_retrieval_eval/test, 5k held-out)self-reported0.989
- MRR@10 on uzbek-embedding-pairs (wiki_retrieval_eval/test, 5k held-out)self-reported0.977
- nDCG@10 on uzbek-embedding-pairs (wiki_retrieval_eval/test, 5k held-out)self-reported0.980
- Mean accuracy (both directions) on FLORES+ devtest (uzn_Latn / eng_Latn, 1012 pairs)self-reported0.850