Sentence Similarity
sentence-transformers
Safetensors
Uzbek
English
bert
feature-extraction
embeddings
uzbek
retrieval
e5
Eval Results (legacy)
text-embeddings-inference
Instructions to use sukhrobnurali/uzbek-e5-small with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- sentence-transformers
How to use sukhrobnurali/uzbek-e5-small with sentence-transformers:
from sentence_transformers import SentenceTransformer model = SentenceTransformer("sukhrobnurali/uzbek-e5-small") sentences = [ "The weather is lovely today.", "It's so sunny outside!", "He drove to the stadium." ] embeddings = model.encode(sentences) similarities = model.similarity(embeddings, embeddings) print(similarities.shape) # [3, 3] - Notebooks
- Google Colab
- Kaggle
File size: 6,192 Bytes
611e9ee c77a14b 611e9ee c77a14b 611e9ee c77a14b 611e9ee c77a14b 611e9ee c77a14b 611e9ee c77a14b 611e9ee c77a14b 611e9ee c77a14b 611e9ee c77a14b 611e9ee c77a14b 611e9ee c77a14b 611e9ee c77a14b 611e9ee c77a14b 611e9ee c77a14b 611e9ee c77a14b 611e9ee f3a557b c77a14b 611e9ee c77a14b 611e9ee c77a14b 611e9ee c77a14b 611e9ee c77a14b 611e9ee c77a14b | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 | ---
language:
- uz
- en
license: mit
library_name: sentence-transformers
pipeline_tag: sentence-similarity
base_model: intfloat/multilingual-e5-small
datasets:
- sukhrobnurali/uzbek-embedding-pairs
tags:
- sentence-transformers
- sentence-similarity
- feature-extraction
- embeddings
- uzbek
- retrieval
- e5
model-index:
- name: uzbek-e5-small
results:
- task:
type: information-retrieval
name: Monolingual Uzbek retrieval (Wikipedia title to paragraph)
dataset:
type: sukhrobnurali/uzbek-embedding-pairs
name: uzbek-embedding-pairs (wiki_retrieval_eval/test, 5k held-out)
metrics:
- type: recall_at_1
value: 0.9914
name: Recall@1
- type: recall_at_5
value: 0.9964
name: Recall@5
- type: recall_at_10
value: 0.9972
name: Recall@10
- type: mrr_at_10
value: 0.9936
name: MRR@10
- type: ndcg_at_10
value: 0.9945
name: nDCG@10
- task:
type: bitext-mining
name: Cross-lingual uz-en bitext retrieval (FLORES+ devtest)
dataset:
type: openlanguagedata/flores_plus
name: FLORES+ devtest (uzn_Latn / eng_Latn, 1012 pairs)
metrics:
- type: accuracy
value: 0.9876
name: Mean accuracy (both directions)
---
# uzbek-e5-small
A small multilingual sentence-embedding model fine-tuned from
[`intfloat/multilingual-e5-small`](https://huggingface.co/intfloat/multilingual-e5-small)
for **Uzbek semantic search and retrieval**, including cross-lingual uz↔en.
This is the flagship model of a two-base study. Its base, `multilingual-e5-small`, is
already strong at Uzbek, so fine-tuning yields a small but consistently positive gain —
the honest result for a near-ceiling starting point. The companion
[`sukhrobnurali/uzbek-minilm`](https://huggingface.co/sukhrobnurali/uzbek-minilm) starts
from a much weaker base and shows the large delta.
- **Base model:** `intfloat/multilingual-e5-small` (118M params, 384-dim, MIT)
- **Language:** Uzbek (Latin), with English for cross-lingual pairs
- **Training data:** [`sukhrobnurali/uzbek-embedding-pairs`](https://huggingface.co/datasets/sukhrobnurali/uzbek-embedding-pairs)
- **Objective:** `MultipleNegativesRankingLoss`, 1 epoch
- **Training code:** https://github.com/sukhrobnurali/uz-sentance-embedding
## Intended use
- Uzbek semantic search / passage retrieval (RAG over Uzbek documents).
- Cross-lingual uz↔en retrieval and bitext alignment.
- General-purpose Uzbek sentence similarity and clustering.
### Prefixes (important)
Like the base e5 model, this model expects task prefixes:
- Retrieval: prefix queries with `query: ` and documents with `passage: `.
- Symmetric tasks (similarity, bitext): use `query: ` on both sides.
Embeddings are L2-normalized; compare with cosine similarity (dot product on normalized
vectors).
## Usage
```python
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("sukhrobnurali/uzbek-e5-small")
queries = ["query: O'zbekistonning poytaxti qaysi shahar?"]
passages = [
"passage: Toshkent — O'zbekiston Respublikasining poytaxti va eng yirik shahri.",
"passage: Samarqand — O'zbekistondagi qadimiy shaharlardan biri.",
]
q_emb = model.encode(queries, normalize_embeddings=True)
p_emb = model.encode(passages, normalize_embeddings=True)
scores = q_emb @ p_emb.T
print(scores) # highest score on the Tashkent passage
```
> Requires `sentence-transformers>=5.5.1` — the version the model was saved with.
> Older versions cannot load it (`ModuleNotFoundError: No module named 'sentence_transformers.base'`).
## Evaluation
The same protocol is applied to the base and fine-tuned models so the delta is a fair
comparison. Both held-out sets are disjoint from training: the retrieval split is the
dataset's `wiki_retrieval_eval/test`; FLORES+ training only ever sees `dev` (via the
dataset's validation split), so `devtest` stays clean.
### Monolingual Uzbek retrieval — `wiki_retrieval_eval/test` (5,000 title→paragraph)
| Metric | Base | Fine-tuned | Δ |
|---|---|---|---|
| Recall@1 | 0.9868 | **0.9914** | +0.0046 |
| Recall@5 | 0.9948 | **0.9964** | +0.0016 |
| Recall@10 | 0.9962 | **0.9972** | +0.0010 |
| MRR@10 | 0.9904 | **0.9936** | +0.0032 |
| nDCG@10 | 0.9918 | **0.9945** | +0.0027 |
### Cross-lingual uz↔en bitext — FLORES+ `devtest` (1,012 pairs)
| Metric | Base | Fine-tuned | Δ |
|---|---|---|---|
| uz→en accuracy | 0.9713 | **0.9901** | +0.0188 |
| en→uz accuracy | 0.9852 | 0.9852 | +0.0000 |
| Mean accuracy | 0.9783 | **0.9876** | +0.0094 |
The base is already near the ceiling for Uzbek, so the gains are small but every metric
improves — the fine-tuned model dominates the base and is the one shipped.
## Training data
[`sukhrobnurali/uzbek-embedding-pairs`](https://huggingface.co/datasets/sukhrobnurali/uzbek-embedding-pairs)
— 356,278 `(anchor, positive)` pairs in the `default/train` split, used as-is:
| Source | Share | Pair type |
|---|---|---|
| Uzbek Wikipedia | ~56% | title ↔ paragraph |
| OPUS-100 uz↔en | ~34% | parallel sentence ↔ translation |
| Latin↔Cyrillic | ~10% | same sentence, two scripts |
Anchors are prefixed `query: ` and positives `passage: ` at training time, matching the
retrieval framing used at inference.
## Limitations
- Low-resource language: coverage is thinner than for high-resource languages, and rare
domains (legal, medical, dialectal) are under-represented.
- Part of the data is OPUS-100, which carries machine-translation noise.
- The monolingual eval is a title→paragraph proxy for retrieval, not a curated query set.
- Trained on Latin-script Uzbek; Cyrillic appears only via the script-pair source, so
Cyrillic retrieval quality is less thoroughly evaluated.
## Reproducibility
Fixed seed (42); 1 epoch of `MultipleNegativesRankingLoss` with in-batch negatives,
batch size 192, lr 2e-5, 10% warmup, `max_seq_length=192`, bf16 on Ampere. All
hyperparameters live in `config.py`; training and evaluation scripts are in the
[training repository](https://github.com/sukhrobnurali/uz-sentance-embedding).
## License
MIT, inherited from the base model `intfloat/multilingual-e5-small`.
|