File size: 6,192 Bytes

611e9ee
 
 
 
c77a14b
 
 
 
 
 
611e9ee
 
 
 
c77a14b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
611e9ee
 
c77a14b
611e9ee
c77a14b
 
 
611e9ee
c77a14b
 
 
 
 
611e9ee
c77a14b
 
 
 
 
611e9ee
c77a14b
611e9ee
c77a14b
 
 
611e9ee
c77a14b
611e9ee
c77a14b
611e9ee
c77a14b
 
611e9ee
c77a14b
 
611e9ee
c77a14b
611e9ee
 
 
 
 
c77a14b
 
 
 
 
611e9ee
 
c77a14b
 
 
 
611e9ee
 
f3a557b
 
 
c77a14b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
611e9ee
c77a14b
 
 
 
 
 
611e9ee
c77a14b
611e9ee
c77a14b
 
 
 
611e9ee
c77a14b
611e9ee
c77a14b

---
language:
- uz
- en
license: mit
library_name: sentence-transformers
pipeline_tag: sentence-similarity
base_model: intfloat/multilingual-e5-small
datasets:
- sukhrobnurali/uzbek-embedding-pairs
tags:
- sentence-transformers
- sentence-similarity
- feature-extraction
- embeddings
- uzbek
- retrieval
- e5
model-index:
- name: uzbek-e5-small
  results:
  - task:
      type: information-retrieval
      name: Monolingual Uzbek retrieval (Wikipedia title to paragraph)
    dataset:
      type: sukhrobnurali/uzbek-embedding-pairs
      name: uzbek-embedding-pairs (wiki_retrieval_eval/test, 5k held-out)
    metrics:
    - type: recall_at_1
      value: 0.9914
      name: Recall@1
    - type: recall_at_5
      value: 0.9964
      name: Recall@5
    - type: recall_at_10
      value: 0.9972
      name: Recall@10
    - type: mrr_at_10
      value: 0.9936
      name: MRR@10
    - type: ndcg_at_10
      value: 0.9945
      name: nDCG@10
  - task:
      type: bitext-mining
      name: Cross-lingual uz-en bitext retrieval (FLORES+ devtest)
    dataset:
      type: openlanguagedata/flores_plus
      name: FLORES+ devtest (uzn_Latn / eng_Latn, 1012 pairs)
    metrics:
    - type: accuracy
      value: 0.9876
      name: Mean accuracy (both directions)
---

# uzbek-e5-small

A small multilingual sentence-embedding model fine-tuned from
[`intfloat/multilingual-e5-small`](https://huggingface.co/intfloat/multilingual-e5-small)
for **Uzbek semantic search and retrieval**, including cross-lingual uz↔en.

This is the flagship model of a two-base study. Its base, `multilingual-e5-small`, is
already strong at Uzbek, so fine-tuning yields a small but consistently positive gain —
the honest result for a near-ceiling starting point. The companion
[`sukhrobnurali/uzbek-minilm`](https://huggingface.co/sukhrobnurali/uzbek-minilm) starts
from a much weaker base and shows the large delta.

- **Base model:** `intfloat/multilingual-e5-small` (118M params, 384-dim, MIT)
- **Language:** Uzbek (Latin), with English for cross-lingual pairs
- **Training data:** [`sukhrobnurali/uzbek-embedding-pairs`](https://huggingface.co/datasets/sukhrobnurali/uzbek-embedding-pairs)
- **Objective:** `MultipleNegativesRankingLoss`, 1 epoch
- **Training code:** https://github.com/sukhrobnurali/uz-sentance-embedding

## Intended use

- Uzbek semantic search / passage retrieval (RAG over Uzbek documents).
- Cross-lingual uz↔en retrieval and bitext alignment.
- General-purpose Uzbek sentence similarity and clustering.

### Prefixes (important)

Like the base e5 model, this model expects task prefixes:

- Retrieval: prefix queries with `query: ` and documents with `passage: `.
- Symmetric tasks (similarity, bitext): use `query: ` on both sides.

Embeddings are L2-normalized; compare with cosine similarity (dot product on normalized
vectors).

## Usage

```python
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("sukhrobnurali/uzbek-e5-small")

queries = ["query: O'zbekistonning poytaxti qaysi shahar?"]
passages = [
    "passage: Toshkent — O'zbekiston Respublikasining poytaxti va eng yirik shahri.",
    "passage: Samarqand — O'zbekistondagi qadimiy shaharlardan biri.",
]

q_emb = model.encode(queries, normalize_embeddings=True)
p_emb = model.encode(passages, normalize_embeddings=True)
scores = q_emb @ p_emb.T
print(scores)  # highest score on the Tashkent passage
```

> Requires `sentence-transformers>=5.5.1` — the version the model was saved with.
> Older versions cannot load it (`ModuleNotFoundError: No module named 'sentence_transformers.base'`).

## Evaluation

The same protocol is applied to the base and fine-tuned models so the delta is a fair
comparison. Both held-out sets are disjoint from training: the retrieval split is the
dataset's `wiki_retrieval_eval/test`; FLORES+ training only ever sees `dev` (via the
dataset's validation split), so `devtest` stays clean.

### Monolingual Uzbek retrieval — `wiki_retrieval_eval/test` (5,000 title→paragraph)

| Metric | Base | Fine-tuned | Δ |
|---|---|---|---|
| Recall@1 | 0.9868 | **0.9914** | +0.0046 |
| Recall@5 | 0.9948 | **0.9964** | +0.0016 |
| Recall@10 | 0.9962 | **0.9972** | +0.0010 |
| MRR@10 | 0.9904 | **0.9936** | +0.0032 |
| nDCG@10 | 0.9918 | **0.9945** | +0.0027 |

### Cross-lingual uz↔en bitext — FLORES+ `devtest` (1,012 pairs)

| Metric | Base | Fine-tuned | Δ |
|---|---|---|---|
| uz→en accuracy | 0.9713 | **0.9901** | +0.0188 |
| en→uz accuracy | 0.9852 | 0.9852 | +0.0000 |
| Mean accuracy | 0.9783 | **0.9876** | +0.0094 |

The base is already near the ceiling for Uzbek, so the gains are small but every metric
improves — the fine-tuned model dominates the base and is the one shipped.

## Training data

[`sukhrobnurali/uzbek-embedding-pairs`](https://huggingface.co/datasets/sukhrobnurali/uzbek-embedding-pairs)
— 356,278 `(anchor, positive)` pairs in the `default/train` split, used as-is:

| Source | Share | Pair type |
|---|---|---|
| Uzbek Wikipedia | ~56% | title ↔ paragraph |
| OPUS-100 uz↔en | ~34% | parallel sentence ↔ translation |
| Latin↔Cyrillic | ~10% | same sentence, two scripts |

Anchors are prefixed `query: ` and positives `passage: ` at training time, matching the
retrieval framing used at inference.

## Limitations

- Low-resource language: coverage is thinner than for high-resource languages, and rare
  domains (legal, medical, dialectal) are under-represented.
- Part of the data is OPUS-100, which carries machine-translation noise.
- The monolingual eval is a title→paragraph proxy for retrieval, not a curated query set.
- Trained on Latin-script Uzbek; Cyrillic appears only via the script-pair source, so
  Cyrillic retrieval quality is less thoroughly evaluated.

## Reproducibility

Fixed seed (42); 1 epoch of `MultipleNegativesRankingLoss` with in-batch negatives,
batch size 192, lr 2e-5, 10% warmup, `max_seq_length=192`, bf16 on Ampere. All
hyperparameters live in `config.py`; training and evaluation scripts are in the
[training repository](https://github.com/sukhrobnurali/uz-sentance-embedding).

## License

MIT, inherited from the base model `intfloat/multilingual-e5-small`.