---
language:
- it
tags:
- sentence-transformers
- sentence-similarity
- feature-extraction
- dense
- matryoshka
- information-retrieval
- generated_from_trainer
dataset_size: 50749
loss:
- MatryoshkaLoss
- CachedMultipleNegativesRankingLoss
- CoSENTLoss
base_model: nickprock/multi-sentence-BERTino
widget:
- source_sentence: >-
    Ci stiamo muovendo "... rispetto al commovente telaio cosmico di riposo ...
    a circa 371 km/s verso la costellazione del Leone".
  sentences:
  - Una donna sta tagliando le cipolle verdi.
  - Non c'è un 'fermo' che non sia relativo a qualche altro oggetto.
  - Un gruppo di anziani si mette in posa attorno a un tavolo da pranzo.
- source_sentence: L'uomo ha parlato con una ragazza attraverso la telecamera di internet.
  sentences:
  - La ragazza è in piedi davanti alla porta aperta dell'autobus.
  - Il giocatore di basket sta per segnare punti per la sua squadra.
  - Un adolescente parla con una ragazza tramite una webcam.
- source_sentence: Qual è stato il risultato della finale del Cincinnati Open 1971?
  sentences:
  - >-
    Partecipò ai Giochi della II Olimpiade di Parigi del 1900 conquistando una
    medaglia d'argento nel rugby a 15 con il SC 1880 Frankfurt, squadra
    rappresentante la Germania.
  - |-
    Biografia
    Ha un gemello, chiamato Thomas, anch'egli calciatore professionista.
  - >-
    Il singolare maschile del torneo di tennis Cincinnati Open 1971, facente
    parte della categoria Grand Prix, ha avuto come vincitore Stan Smith che ha
    battuto in finale Juan Gisbert 7-6, 6-3.
pipeline_tag: sentence-similarity
library_name: sentence-transformers
metrics:
- cosine_accuracy@1
- cosine_accuracy@10
- cosine_ndcg@10
- cosine_mrr@10
- cosine_map@100
- pearson_cosine
- spearman_cosine
license: apache-2.0
datasets:
- nickprock/it-wiki-retrieval-synthetic-hn
- PhilipMay/stsb_multi_mt
---

# multi-sentence-BERTino (V5 - Matryoshka Enhanced)

This is a state-of-the-art [sentence-transformers](https://www.SBERT.net) model for the Italian language. It maps sentences and paragraphs to a flexible dense vector space (up to 768 dimensions) and is highly optimized for semantic search, retrieval-augmented generation (RAG), and semantic textual similarity.

## What's New in V5

V5 improves upon V4 with a focus on **better embedding compression**: the 128-dimension truncated vectors are significantly stronger, making this version more practical for production deployments with storage or latency constraints.

Key changes from V4:
- **`CoSENTLoss`** replaces `CosineSimilarityLoss` for the STS task, yielding better ranking-aware similarity training.
- **Asymmetric Matryoshka weights revised** from `[1.0, 0.3, 0.15, 0.1]` to `[1.0, 0.4, 0.2, 0.2]`, placing more training pressure on the 128d subspace.
- **Cosine LR scheduler** with `weight_decay=0.01` for more stable optimization.
- **STS dataset balanced** to match retrieval dataset size, preventing disproportionate STS gradient updates.

## Model Highlights: Matryoshka Representation Learning

This model was fine-tuned using **Matryoshka Representation Learning (MRL)**. The model has learned to hierarchically compress its semantic knowledge into the earliest dimensions of the vector. You can safely truncate the output embeddings to **512, 256, or 128 dimensions** with minimal degradation in retrieval metrics.

Truncating to 128 dimensions allows you to **save up to 83% of storage costs** in vector databases (like Pinecone, Qdrant, or Milvus) and drastically speed up similarity searches, while still outperforming standard 128d baselines.

The model was trained exclusively on **Semantic Hard Negatives** (mined via dense bi-encoder self-retrieval) to prevent the "false-negative" traps commonly caused by traditional BM25 lexical mining.

## Usage

### Direct Usage (Sentence Transformers)

First, install the Sentence Transformers library:

```bash
pip install -U sentence-transformers
```

**Standard Usage (Full 768 Dimensions):**
```python
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("nickprock/multi-sentence-BERTino")

sentences = [
    'Chi ha dipinto la Gioconda?',
    'Leonardo da Vinci è l\'autore della Gioconda, opera conservata al Louvre.',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# Output: (2, 768)
```

**Optimized Usage (Truncated to 128 Dimensions):**
```python
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("nickprock/multi-sentence-BERTino", truncate_dim=128)

embeddings = model.encode(sentences)
print(embeddings.shape)
# Output: (2, 128) -> 83% less memory!
```

## Evaluation Metrics

Evaluated on a 5% hold-out split of the Italian retrieval dataset (with semantic hard negatives) and the Italian STS-B dev set, using a standalone evaluation after training.

### Information Retrieval

| Metric | 768d (Full) | 128d (Truncated) |
|:-------|:------------|:-----------------|
| **MAP@100** | 0.8398 | **0.8065** |
| **NDCG@10** | 0.8680 | **0.8372** |
| **Accuracy@1** | 0.7617 | **0.7233** |
| **Accuracy@10** | 0.9584 | 0.9384 |

### Comparison with V4

| Metric | V4 (768d) | V5 (768d) | V4 (128d) | V5 (128d) |
|:-------|:----------|:----------|:----------|:----------|
| **MAP@100** | 0.8397 | **0.8398** | 0.8002 | **0.8065** (+0.63%) |
| **NDCG@10** | **0.8688** | 0.8680 | 0.8332 | **0.8372** (+0.48%) |
| **Accuracy@1** | 0.7593 | **0.7617** | 0.7145 | **0.7233** (+1.23%) |

> V5 trades a negligible variation at 768d for a substantial improvement at 128d, making compressed embeddings considerably more reliable.

### Semantic Textual Similarity (STS-B Italian Dev)

| Metric | V4 | V5 |
|:-------|:---|:---|
| **Spearman Cosine** | 0.8540 | **0.8549** |
| **Pearson Cosine** | 0.8574 | **0.8574** |

## Training Details

### Loss Functions

The model was trained in a multi-task setup utilizing Gradient Caching for massive logical batch sizes, wrapped inside a Matryoshka Loss:

1. **Information Retrieval Task:** `CachedMultipleNegativesRankingLoss` with `mini_batch_size=16` and a logical `batch_size=128`.
2. **Semantic Similarity Task:** `CoSENTLoss` (upgraded from `CosineSimilarityLoss` in V4).

Both base losses were wrapped in `MatryoshkaLoss` targeting dimensions `[768, 512, 256, 128]` with weights `[1.0, 0.4, 0.2, 0.2]`.

### Training Datasets

- **task_retrieval**: ~45,000 synthetic Italian search queries generated via LLM (Qwen-2.5-7B) from Italian Wikipedia paragraphs. Each query is paired with 1 positive document and 2 Dense Hard Negatives.
- **task_sts**: The Italian split of `stsb_multi_mt`, balanced to match the retrieval dataset size.

### Hyperparameters

| Parameter | Value |
|:----------|:------|
| `per_device_train_batch_size` | 128 |
| `num_train_epochs` | 4 |
| `learning_rate` | 1e-05 |
| `lr_scheduler_type` | cosine |
| `warmup_steps` | 10% |
| `weight_decay` | 0.01 |
| `fp16` | True |
| `batch_sampler` | no_duplicates |
| `best_checkpoint` | step 1250 (epoch ~1.76) |

## Citation

### BibTeX

#### MatryoshkaLoss
```bibtex
@misc{kusupati2024matryoshka,
    title={Matryoshka Representation Learning},
    author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
    year={2024},
    eprint={2205.13147},
    archivePrefix={arXiv},
    primaryClass={cs.LG}
}
```

#### CachedMultipleNegativesRankingLoss
```bibtex
@misc{gao2021scaling,
    title={Scaling Deep Contrastive Learning Batch Size under Memory Limited Setup},
    author={Luyu Gao and Yunyi Zhang and Jiawei Han and Jamie Callan},
    year={2021},
    eprint={2101.06983},
    archivePrefix={arXiv},
    primaryClass={cs.LG}
}
```

#### CoSENTLoss
```bibtex
@misc{su2022cosent,
    title={CoSENT: A More Efficient Sentence Vector Training Method Than Sentence-BERT},
    author={Jianlin Su},
    year={2022},
    howpublished={\url{https://kexue.fm/archives/8847}}
}
```