--- language: - uz - en license: apache-2.0 library_name: sentence-transformers pipeline_tag: sentence-similarity base_model: sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 datasets: - sukhrobnurali/uzbek-embedding-pairs tags: - sentence-transformers - sentence-similarity - feature-extraction - embeddings - uzbek - retrieval - minilm model-index: - name: uzbek-minilm results: - task: type: information-retrieval name: Monolingual Uzbek retrieval (Wikipedia title to paragraph) dataset: type: sukhrobnurali/uzbek-embedding-pairs name: uzbek-embedding-pairs (wiki_retrieval_eval/test, 5k held-out) metrics: - type: recall_at_1 value: 0.9692 name: Recall@1 - type: recall_at_5 value: 0.9854 name: Recall@5 - type: recall_at_10 value: 0.9892 name: Recall@10 - type: mrr_at_10 value: 0.9765 name: MRR@10 - type: ndcg_at_10 value: 0.9796 name: nDCG@10 - task: type: bitext-mining name: Cross-lingual uz-en bitext retrieval (FLORES+ devtest) dataset: type: openlanguagedata/flores_plus name: FLORES+ devtest (uzn_Latn / eng_Latn, 1012 pairs) metrics: - type: accuracy value: 0.8498 name: Mean accuracy (both directions) --- # uzbek-minilm A multilingual sentence-embedding model fine-tuned from [`sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2`](https://huggingface.co/sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2) for **Uzbek semantic search and retrieval**, including cross-lingual uz↔en. This model demonstrates the large gain achievable when the base model is weak at the target language. The base MiniLM handles Uzbek poorly (Recall@1 = 0.26); after one epoch on Uzbek pairs it reaches Recall@1 = 0.97. The companion flagship [`sukhrobnurali/uzbek-e5-small`](https://huggingface.co/sukhrobnurali/uzbek-e5-small) starts from a stronger base and is the recommended model for cross-lingual work. - **Base model:** `paraphrase-multilingual-MiniLM-L12-v2` (118M params, 384-dim, Apache 2.0) - **Language:** Uzbek (Latin), with English for cross-lingual pairs - **Training data:** [`sukhrobnurali/uzbek-embedding-pairs`](https://huggingface.co/datasets/sukhrobnurali/uzbek-embedding-pairs) - **Objective:** `MultipleNegativesRankingLoss`, 1 epoch - **Training code:** https://github.com/sukhrobnurali/uz-sentance-embedding ## Intended use - Uzbek semantic search / passage retrieval (RAG over Uzbek documents). - Cross-lingual uz↔en retrieval and bitext alignment. - General-purpose Uzbek sentence similarity and clustering. Unlike e5, this model uses **no input prefixes** — encode queries and documents as plain text. Embeddings are L2-normalized; compare with cosine similarity. ## Usage ```python from sentence_transformers import SentenceTransformer model = SentenceTransformer("sukhrobnurali/uzbek-minilm") query = ["O'zbekistonning poytaxti qaysi shahar?"] passages = [ "Toshkent — O'zbekiston Respublikasining poytaxti va eng yirik shahri.", "Samarqand — O'zbekistondagi qadimiy shaharlardan biri.", ] q_emb = model.encode(query, normalize_embeddings=True) p_emb = model.encode(passages, normalize_embeddings=True) scores = q_emb @ p_emb.T print(scores) # highest score on the Tashkent passage ``` > Requires `sentence-transformers>=5.5.1` — the version the model was saved with. > Older versions cannot load it (`ModuleNotFoundError: No module named 'sentence_transformers.base'`). ## Evaluation The same protocol is applied to the base and fine-tuned models so the delta is a fair comparison. Both held-out sets are disjoint from training: the retrieval split is the dataset's `wiki_retrieval_eval/test`; FLORES+ training only ever sees `dev` (via the dataset's validation split), so `devtest` stays clean. ### Monolingual Uzbek retrieval — `wiki_retrieval_eval/test` (5,000 title→paragraph) | Metric | Base | Fine-tuned | Δ | |---|---|---|---| | Recall@1 | 0.2564 | **0.9692** | +0.7128 | | Recall@5 | 0.3742 | **0.9854** | +0.6112 | | Recall@10 | 0.4314 | **0.9892** | +0.5578 | | MRR@10 | 0.3072 | **0.9765** | +0.6693 | | nDCG@10 | 0.3367 | **0.9796** | +0.6429 | ### Cross-lingual uz↔en bitext — FLORES+ `devtest` (1,012 pairs) | Metric | Base | Fine-tuned | Δ | |---|---|---|---| | uz→en accuracy | 0.4575 | **0.8745** | +0.4170 | | en→uz accuracy | 0.4872 | **0.8251** | +0.3379 | | Mean accuracy | 0.4723 | **0.8498** | +0.3775 | Fine-tuning nearly closes the monolingual gap — the fine-tuned MiniLM (R@1 = 0.969) almost matches the e5 *baseline* (0.987). On cross-lingual FLORES it still trails the e5 family (0.85 vs 0.99), so prefer `uzbek-e5-small` when uz↔en accuracy matters most. ## Training data [`sukhrobnurali/uzbek-embedding-pairs`](https://huggingface.co/datasets/sukhrobnurali/uzbek-embedding-pairs) — 356,278 `(anchor, positive)` pairs in the `default/train` split, used as-is: | Source | Share | Pair type | |---|---|---| | Uzbek Wikipedia | ~56% | title ↔ paragraph | | OPUS-100 uz↔en | ~34% | parallel sentence ↔ translation | | Latin↔Cyrillic | ~10% | same sentence, two scripts | No prefixes are applied for this model family. ## Limitations - Low-resource language: coverage is thinner than for high-resource languages, and rare domains (legal, medical, dialectal) are under-represented. - Part of the data is OPUS-100, which carries machine-translation noise. - The monolingual eval is a title→paragraph proxy for retrieval, not a curated query set. - Trained on Latin-script Uzbek; Cyrillic appears only via the script-pair source, so Cyrillic retrieval quality is less thoroughly evaluated. - Cross-lingual uz↔en accuracy trails the e5-based model; use `uzbek-e5-small` for that. ## Reproducibility Fixed seed (42); 1 epoch of `MultipleNegativesRankingLoss` with in-batch negatives, batch size 192, lr 2e-5, 10% warmup, `max_seq_length=192`, bf16 on Ampere. All hyperparameters live in `config.py`; training and evaluation scripts are in the [training repository](https://github.com/sukhrobnurali/uz-sentance-embedding). ## License Apache 2.0, inherited from the base model `paraphrase-multilingual-MiniLM-L12-v2`.