---
language:
- uz
- en
license: mit
library_name: sentence-transformers
pipeline_tag: sentence-similarity
base_model: intfloat/multilingual-e5-small
datasets:
- sukhrobnurali/uzbek-embedding-pairs
tags:
- sentence-transformers
- sentence-similarity
- feature-extraction
- embeddings
- uzbek
- retrieval
- e5
model-index:
- name: uzbek-e5-small
  results:
  - task:
      type: information-retrieval
      name: Monolingual Uzbek retrieval (Wikipedia title to paragraph)
    dataset:
      type: sukhrobnurali/uzbek-embedding-pairs
      name: uzbek-embedding-pairs (wiki_retrieval_eval/test, 5k held-out)
    metrics:
    - type: recall_at_1
      value: 0.9914
      name: Recall@1
    - type: recall_at_5
      value: 0.9964
      name: Recall@5
    - type: recall_at_10
      value: 0.9972
      name: Recall@10
    - type: mrr_at_10
      value: 0.9936
      name: MRR@10
    - type: ndcg_at_10
      value: 0.9945
      name: nDCG@10
  - task:
      type: bitext-mining
      name: Cross-lingual uz-en bitext retrieval (FLORES+ devtest)
    dataset:
      type: openlanguagedata/flores_plus
      name: FLORES+ devtest (uzn_Latn / eng_Latn, 1012 pairs)
    metrics:
    - type: accuracy
      value: 0.9876
      name: Mean accuracy (both directions)
---

# uzbek-e5-small

A small multilingual sentence-embedding model fine-tuned from
[`intfloat/multilingual-e5-small`](https://huggingface.co/intfloat/multilingual-e5-small)
for **Uzbek semantic search and retrieval**, including cross-lingual uz↔en.

This is the flagship model of a two-base study. Its base, `multilingual-e5-small`, is
already strong at Uzbek, so fine-tuning yields a small but consistently positive gain —
the honest result for a near-ceiling starting point. The companion
[`sukhrobnurali/uzbek-minilm`](https://huggingface.co/sukhrobnurali/uzbek-minilm) starts
from a much weaker base and shows the large delta.

- **Base model:** `intfloat/multilingual-e5-small` (118M params, 384-dim, MIT)
- **Language:** Uzbek (Latin), with English for cross-lingual pairs
- **Training data:** [`sukhrobnurali/uzbek-embedding-pairs`](https://huggingface.co/datasets/sukhrobnurali/uzbek-embedding-pairs)
- **Objective:** `MultipleNegativesRankingLoss`, 1 epoch
- **Training code:** https://github.com/sukhrobnurali/uz-sentance-embedding

## Intended use

- Uzbek semantic search / passage retrieval (RAG over Uzbek documents).
- Cross-lingual uz↔en retrieval and bitext alignment.
- General-purpose Uzbek sentence similarity and clustering.

### Prefixes (important)

Like the base e5 model, this model expects task prefixes:

- Retrieval: prefix queries with `query: ` and documents with `passage: `.
- Symmetric tasks (similarity, bitext): use `query: ` on both sides.

Embeddings are L2-normalized; compare with cosine similarity (dot product on normalized
vectors).

## Usage

```python
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("sukhrobnurali/uzbek-e5-small")

queries = ["query: O'zbekistonning poytaxti qaysi shahar?"]
passages = [
    "passage: Toshkent — O'zbekiston Respublikasining poytaxti va eng yirik shahri.",
    "passage: Samarqand — O'zbekistondagi qadimiy shaharlardan biri.",
]

q_emb = model.encode(queries, normalize_embeddings=True)
p_emb = model.encode(passages, normalize_embeddings=True)
scores = q_emb @ p_emb.T
print(scores)  # highest score on the Tashkent passage
```

> Requires `sentence-transformers>=5.5.1` — the version the model was saved with.
> Older versions cannot load it (`ModuleNotFoundError: No module named 'sentence_transformers.base'`).

## Evaluation

The same protocol is applied to the base and fine-tuned models so the delta is a fair
comparison. Both held-out sets are disjoint from training: the retrieval split is the
dataset's `wiki_retrieval_eval/test`; FLORES+ training only ever sees `dev` (via the
dataset's validation split), so `devtest` stays clean.

### Monolingual Uzbek retrieval — `wiki_retrieval_eval/test` (5,000 title→paragraph)

| Metric | Base | Fine-tuned | Δ |
|---|---|---|---|
| Recall@1 | 0.9868 | **0.9914** | +0.0046 |
| Recall@5 | 0.9948 | **0.9964** | +0.0016 |
| Recall@10 | 0.9962 | **0.9972** | +0.0010 |
| MRR@10 | 0.9904 | **0.9936** | +0.0032 |
| nDCG@10 | 0.9918 | **0.9945** | +0.0027 |

### Cross-lingual uz↔en bitext — FLORES+ `devtest` (1,012 pairs)

| Metric | Base | Fine-tuned | Δ |
|---|---|---|---|
| uz→en accuracy | 0.9713 | **0.9901** | +0.0188 |
| en→uz accuracy | 0.9852 | 0.9852 | +0.0000 |
| Mean accuracy | 0.9783 | **0.9876** | +0.0094 |

The base is already near the ceiling for Uzbek, so the gains are small but every metric
improves — the fine-tuned model dominates the base and is the one shipped.

## Training data

[`sukhrobnurali/uzbek-embedding-pairs`](https://huggingface.co/datasets/sukhrobnurali/uzbek-embedding-pairs)
— 356,278 `(anchor, positive)` pairs in the `default/train` split, used as-is:

| Source | Share | Pair type |
|---|---|---|
| Uzbek Wikipedia | ~56% | title ↔ paragraph |
| OPUS-100 uz↔en | ~34% | parallel sentence ↔ translation |
| Latin↔Cyrillic | ~10% | same sentence, two scripts |

Anchors are prefixed `query: ` and positives `passage: ` at training time, matching the
retrieval framing used at inference.

## Limitations

- Low-resource language: coverage is thinner than for high-resource languages, and rare
  domains (legal, medical, dialectal) are under-represented.
- Part of the data is OPUS-100, which carries machine-translation noise.
- The monolingual eval is a title→paragraph proxy for retrieval, not a curated query set.
- Trained on Latin-script Uzbek; Cyrillic appears only via the script-pair source, so
  Cyrillic retrieval quality is less thoroughly evaluated.

## Reproducibility

Fixed seed (42); 1 epoch of `MultipleNegativesRankingLoss` with in-batch negatives,
batch size 192, lr 2e-5, 10% warmup, `max_seq_length=192`, bf16 on Ampere. All
hyperparameters live in `config.py`; training and evaluation scripts are in the
[training repository](https://github.com/sukhrobnurali/uz-sentance-embedding).

## License

MIT, inherited from the base model `intfloat/multilingual-e5-small`.