--- language: - it tags: - sentence-transformers - sentence-similarity - feature-extraction - dense - matryoshka - information-retrieval - generated_from_trainer dataset_size: 50749 loss: - MatryoshkaLoss - CachedMultipleNegativesRankingLoss - CoSENTLoss base_model: nickprock/multi-sentence-BERTino widget: - source_sentence: >- Ci stiamo muovendo "... rispetto al commovente telaio cosmico di riposo ... a circa 371 km/s verso la costellazione del Leone". sentences: - Una donna sta tagliando le cipolle verdi. - Non c'è un 'fermo' che non sia relativo a qualche altro oggetto. - Un gruppo di anziani si mette in posa attorno a un tavolo da pranzo. - source_sentence: L'uomo ha parlato con una ragazza attraverso la telecamera di internet. sentences: - La ragazza è in piedi davanti alla porta aperta dell'autobus. - Il giocatore di basket sta per segnare punti per la sua squadra. - Un adolescente parla con una ragazza tramite una webcam. - source_sentence: Qual è stato il risultato della finale del Cincinnati Open 1971? sentences: - >- Partecipò ai Giochi della II Olimpiade di Parigi del 1900 conquistando una medaglia d'argento nel rugby a 15 con il SC 1880 Frankfurt, squadra rappresentante la Germania. - |- Biografia Ha un gemello, chiamato Thomas, anch'egli calciatore professionista. - >- Il singolare maschile del torneo di tennis Cincinnati Open 1971, facente parte della categoria Grand Prix, ha avuto come vincitore Stan Smith che ha battuto in finale Juan Gisbert 7-6, 6-3. pipeline_tag: sentence-similarity library_name: sentence-transformers metrics: - cosine_accuracy@1 - cosine_accuracy@10 - cosine_ndcg@10 - cosine_mrr@10 - cosine_map@100 - pearson_cosine - spearman_cosine license: apache-2.0 datasets: - nickprock/it-wiki-retrieval-synthetic-hn - PhilipMay/stsb_multi_mt --- # multi-sentence-BERTino (V5 - Matryoshka Enhanced) This is a state-of-the-art [sentence-transformers](https://www.SBERT.net) model for the Italian language. It maps sentences and paragraphs to a flexible dense vector space (up to 768 dimensions) and is highly optimized for semantic search, retrieval-augmented generation (RAG), and semantic textual similarity. ## What's New in V5 V5 improves upon V4 with a focus on **better embedding compression**: the 128-dimension truncated vectors are significantly stronger, making this version more practical for production deployments with storage or latency constraints. Key changes from V4: - **`CoSENTLoss`** replaces `CosineSimilarityLoss` for the STS task, yielding better ranking-aware similarity training. - **Asymmetric Matryoshka weights revised** from `[1.0, 0.3, 0.15, 0.1]` to `[1.0, 0.4, 0.2, 0.2]`, placing more training pressure on the 128d subspace. - **Cosine LR scheduler** with `weight_decay=0.01` for more stable optimization. - **STS dataset balanced** to match retrieval dataset size, preventing disproportionate STS gradient updates. ## Model Highlights: Matryoshka Representation Learning This model was fine-tuned using **Matryoshka Representation Learning (MRL)**. The model has learned to hierarchically compress its semantic knowledge into the earliest dimensions of the vector. You can safely truncate the output embeddings to **512, 256, or 128 dimensions** with minimal degradation in retrieval metrics. Truncating to 128 dimensions allows you to **save up to 83% of storage costs** in vector databases (like Pinecone, Qdrant, or Milvus) and drastically speed up similarity searches, while still outperforming standard 128d baselines. The model was trained exclusively on **Semantic Hard Negatives** (mined via dense bi-encoder self-retrieval) to prevent the "false-negative" traps commonly caused by traditional BM25 lexical mining. ## Usage ### Direct Usage (Sentence Transformers) First, install the Sentence Transformers library: ```bash pip install -U sentence-transformers ``` **Standard Usage (Full 768 Dimensions):** ```python from sentence_transformers import SentenceTransformer model = SentenceTransformer("nickprock/multi-sentence-BERTino") sentences = [ 'Chi ha dipinto la Gioconda?', 'Leonardo da Vinci è l\'autore della Gioconda, opera conservata al Louvre.', ] embeddings = model.encode(sentences) print(embeddings.shape) # Output: (2, 768) ``` **Optimized Usage (Truncated to 128 Dimensions):** ```python from sentence_transformers import SentenceTransformer model = SentenceTransformer("nickprock/multi-sentence-BERTino", truncate_dim=128) embeddings = model.encode(sentences) print(embeddings.shape) # Output: (2, 128) -> 83% less memory! ``` ## Evaluation Metrics Evaluated on a 5% hold-out split of the Italian retrieval dataset (with semantic hard negatives) and the Italian STS-B dev set, using a standalone evaluation after training. ### Information Retrieval | Metric | 768d (Full) | 128d (Truncated) | |:-------|:------------|:-----------------| | **MAP@100** | 0.8398 | **0.8065** | | **NDCG@10** | 0.8680 | **0.8372** | | **Accuracy@1** | 0.7617 | **0.7233** | | **Accuracy@10** | 0.9584 | 0.9384 | ### Comparison with V4 | Metric | V4 (768d) | V5 (768d) | V4 (128d) | V5 (128d) | |:-------|:----------|:----------|:----------|:----------| | **MAP@100** | 0.8397 | **0.8398** | 0.8002 | **0.8065** (+0.63%) | | **NDCG@10** | **0.8688** | 0.8680 | 0.8332 | **0.8372** (+0.48%) | | **Accuracy@1** | 0.7593 | **0.7617** | 0.7145 | **0.7233** (+1.23%) | > V5 trades a negligible variation at 768d for a substantial improvement at 128d, making compressed embeddings considerably more reliable. ### Semantic Textual Similarity (STS-B Italian Dev) | Metric | V4 | V5 | |:-------|:---|:---| | **Spearman Cosine** | 0.8540 | **0.8549** | | **Pearson Cosine** | 0.8574 | **0.8574** | ## Training Details ### Loss Functions The model was trained in a multi-task setup utilizing Gradient Caching for massive logical batch sizes, wrapped inside a Matryoshka Loss: 1. **Information Retrieval Task:** `CachedMultipleNegativesRankingLoss` with `mini_batch_size=16` and a logical `batch_size=128`. 2. **Semantic Similarity Task:** `CoSENTLoss` (upgraded from `CosineSimilarityLoss` in V4). Both base losses were wrapped in `MatryoshkaLoss` targeting dimensions `[768, 512, 256, 128]` with weights `[1.0, 0.4, 0.2, 0.2]`. ### Training Datasets - **task_retrieval**: ~45,000 synthetic Italian search queries generated via LLM (Qwen-2.5-7B) from Italian Wikipedia paragraphs. Each query is paired with 1 positive document and 2 Dense Hard Negatives. - **task_sts**: The Italian split of `stsb_multi_mt`, balanced to match the retrieval dataset size. ### Hyperparameters | Parameter | Value | |:----------|:------| | `per_device_train_batch_size` | 128 | | `num_train_epochs` | 4 | | `learning_rate` | 1e-05 | | `lr_scheduler_type` | cosine | | `warmup_steps` | 10% | | `weight_decay` | 0.01 | | `fp16` | True | | `batch_sampler` | no_duplicates | | `best_checkpoint` | step 1250 (epoch ~1.76) | ## Citation ### BibTeX #### MatryoshkaLoss ```bibtex @misc{kusupati2024matryoshka, title={Matryoshka Representation Learning}, author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi}, year={2024}, eprint={2205.13147}, archivePrefix={arXiv}, primaryClass={cs.LG} } ``` #### CachedMultipleNegativesRankingLoss ```bibtex @misc{gao2021scaling, title={Scaling Deep Contrastive Learning Batch Size under Memory Limited Setup}, author={Luyu Gao and Yunyi Zhang and Jiawei Han and Jamie Callan}, year={2021}, eprint={2101.06983}, archivePrefix={arXiv}, primaryClass={cs.LG} } ``` #### CoSENTLoss ```bibtex @misc{su2022cosent, title={CoSENT: A More Efficient Sentence Vector Training Method Than Sentence-BERT}, author={Jianlin Su}, year={2022}, howpublished={\url{https://kexue.fm/archives/8847}} } ```