Matryoshka Representation Learning
Paper • 2205.13147 • Published • 27
How to use FareedKhan/mixedbread-ai_deepset-mxbai-embed-de-large-v1_FareedKhan_prime_synthetic_data_2k_3_8 with sentence-transformers:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("FareedKhan/mixedbread-ai_deepset-mxbai-embed-de-large-v1_FareedKhan_prime_synthetic_data_2k_3_8")
sentences = [
"\nThe document you provided seems to be a list of compounds, some of which are well-known drugs or drugs used in experimental contexts, and others that don't appear to have recognized applications in medicine or science. The list includes substances like cortisol, a hormone, and filopram, which is related to anesthetics or possibly a misprint or misclassification. The side effects listed are also a mix, with some being plausible reactions to certain medication (like Edema, dysphagia) and others being quite unusual for commonly recognized drug interactions (like retinal vein occlusion, which is not a typical side effect of most medications).\n\nIt would be useful to have labels or references indicating which of these compounds are actually drugs and which are other chemical substances. For instance, cortisol, if given its correct context, would typically have side effects associated with cortisol therapy such as fluid retention or electrolyte imbalances.\n\nIf you need detailed information on how these substances work or what their possible side effects might be, you'll likely need to refer to a medical compendium or a pharmacology resource for accurate data. It's also important to clarify the intended use for this list, whether for educational purposes, research, or another context; the provided list appears to be a jumbled amalgamation, which might not have clear clinical relevance without additional detail.",
"Can you suggest medications targeting the GC gene/protein with a proven synergy with AVE9633?",
"Could you help identify the gene or protein that facilitates sodium-dependent transportation and elimination of organic anions, with a particular emphasis on those implicated in the cellular efflux of potentially hazardous organic anions? Additionally, I'm interested in understanding if this gene or protein also mediates the transport of drugs known to exhibit synergistic pharmacological interactions with Ractopamine.",
"Can you list the medications suitable for benign prostatic hyperplasia and tell me if any are linked to dysphagia as a side effect?"
]
embeddings = model.encode(sentences)
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [4, 4]This is a sentence-transformers model finetuned from mixedbread-ai/deepset-mxbai-embed-de-large-v1 on the json dataset. It maps sentences & paragraphs to a 1024-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
SentenceTransformer(
(0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: XLMRobertaModel
(1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
(2): Normalize()
)
First install the Sentence Transformers library:
pip install -U sentence-transformers
Then you can load this model and run inference.
from sentence_transformers import SentenceTransformer
# Download from the 🤗 Hub
model = SentenceTransformer("FareedKhan/mixedbread-ai_deepset-mxbai-embed-de-large-v1_FareedKhan_prime_synthetic_data_2k_3_8")
# Run inference
sentences = [
'\nThe list you provided seems to be a mix of various chemical substances, some of which appear to be medications, others are chemical compounds, and a few could be substances from other fields (e.g., water treatment, food additives). To be more precise, it would be helpful to categorize them properly based on their common usage:\n\n### Medications and Drugs:\n- **Antibiotics**: Cefoxitin, Tobramycin, Amikacin\n- ** pain and inflammation relievers**: Benoxaprofen, Daptomycin, Ceftolozane, Salicylates (Benzydamine, Dexamethasone sodium phosphate)\n- **Intravenous fluids**: Magnesium trisilicate\n- **Antivirals**: Ribavirin, Inotersen\n- **Antibacterial agents**: Epirizole, Floctafenine, Flunixin\n- **Vaccines**: Vaborbactam, Brincidofovir, Adefovir\n- **Neuromodulators**: Cefatrizine, Bumadizone, Alminoprofen\n- **Cancer treatments**: Colistin, Nitrofurantoin, Sisomicin\n\n### Chemical Compounds:\n- **Salts and salts of acidity**: Fosfomycin, Azosemide, Mofebutazone\n- **Amino acids**: Phenylalanine, Nitrosalicylic',
'Which drugs interact with the SERPINA1 gene/protein as carriers?',
'Is there a regulatory function associated with the epidermal growth factor receptor or its interacting proteins in the control of genes or proteins that participate in the inactivation of fast sodium channels during Phase 1 of cardiac action potential propagation?',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 1024]
# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]
dim_768InformationRetrievalEvaluator| Metric | Value |
|---|---|
| cosine_accuracy@1 | 0.3911 |
| cosine_accuracy@3 | 0.4752 |
| cosine_accuracy@5 | 0.495 |
| cosine_accuracy@10 | 0.5545 |
| cosine_precision@1 | 0.3911 |
| cosine_precision@3 | 0.1584 |
| cosine_precision@5 | 0.099 |
| cosine_precision@10 | 0.0554 |
| cosine_recall@1 | 0.3911 |
| cosine_recall@3 | 0.4752 |
| cosine_recall@5 | 0.495 |
| cosine_recall@10 | 0.5545 |
| cosine_ndcg@10 | 0.467 |
| cosine_mrr@10 | 0.4398 |
| cosine_map@100 | 0.4462 |
positive and anchor| positive | anchor | |
|---|---|---|
| type | string | string |
| details |
|
|
| positive | anchor |
|---|---|
|
Identify common genetic targets that interact with both N-(3,5-dibromo-4-hydroxyphenyl)benzamide and 1-Naphthylamine-5-sulfonic acid. |
|
Which anatomical structures lack expression of genes or proteins involved in the homogentisate degradation pathway? |
|
Identify genes or proteins that interact with angiotensin-converting enzyme 2 (ACE2) and are linked to a common phenotype or effect. |
MatryoshkaLoss with these parameters:{
"loss": "MultipleNegativesRankingLoss",
"matryoshka_dims": [
768
],
"matryoshka_weights": [
1
],
"n_dims_per_step": -1
}
eval_strategy: epochlearning_rate: 1e-05warmup_ratio: 0.1bf16: Truetf32: Falseload_best_model_at_end: Trueoverwrite_output_dir: Falsedo_predict: Falseeval_strategy: epochprediction_loss_only: Trueper_device_train_batch_size: 8per_device_eval_batch_size: 8per_gpu_train_batch_size: Noneper_gpu_eval_batch_size: Nonegradient_accumulation_steps: 1eval_accumulation_steps: Nonetorch_empty_cache_steps: Nonelearning_rate: 1e-05weight_decay: 0.0adam_beta1: 0.9adam_beta2: 0.999adam_epsilon: 1e-08max_grad_norm: 1.0num_train_epochs: 3max_steps: -1lr_scheduler_type: linearlr_scheduler_kwargs: {}warmup_ratio: 0.1warmup_steps: 0log_level: passivelog_level_replica: warninglog_on_each_node: Truelogging_nan_inf_filter: Truesave_safetensors: Truesave_on_each_node: Falsesave_only_model: Falserestore_callback_states_from_checkpoint: Falseno_cuda: Falseuse_cpu: Falseuse_mps_device: Falseseed: 42data_seed: Nonejit_mode_eval: Falseuse_ipex: Falsebf16: Truefp16: Falsefp16_opt_level: O1half_precision_backend: autobf16_full_eval: Falsefp16_full_eval: Falsetf32: Falselocal_rank: 0ddp_backend: Nonetpu_num_cores: Nonetpu_metrics_debug: Falsedebug: []dataloader_drop_last: Falsedataloader_num_workers: 0dataloader_prefetch_factor: Nonepast_index: -1disable_tqdm: Falseremove_unused_columns: Truelabel_names: Noneload_best_model_at_end: Trueignore_data_skip: Falsefsdp: []fsdp_min_num_params: 0fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}fsdp_transformer_layer_cls_to_wrap: Noneaccelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}deepspeed: Nonelabel_smoothing_factor: 0.0optim: adamw_torchoptim_args: Noneadafactor: Falsegroup_by_length: Falselength_column_name: lengthddp_find_unused_parameters: Noneddp_bucket_cap_mb: Noneddp_broadcast_buffers: Falsedataloader_pin_memory: Truedataloader_persistent_workers: Falseskip_memory_metrics: Trueuse_legacy_prediction_loop: Falsepush_to_hub: Falseresume_from_checkpoint: Nonehub_model_id: Nonehub_strategy: every_savehub_private_repo: Falsehub_always_push: Falsegradient_checkpointing: Falsegradient_checkpointing_kwargs: Noneinclude_inputs_for_metrics: Falseeval_do_concat_batches: Truefp16_backend: autopush_to_hub_model_id: Nonepush_to_hub_organization: Nonemp_parameters: auto_find_batch_size: Falsefull_determinism: Falsetorchdynamo: Noneray_scope: lastddp_timeout: 1800torch_compile: Falsetorch_compile_backend: Nonetorch_compile_mode: Nonedispatch_batches: Nonesplit_batches: Noneinclude_tokens_per_second: Falseinclude_num_input_tokens_seen: Falseneftune_noise_alpha: Noneoptim_target_modules: Nonebatch_eval_metrics: Falseeval_on_start: Falseuse_liger_kernel: Falseeval_use_gather_object: Falsebatch_sampler: batch_samplermulti_dataset_batch_sampler: proportional| Epoch | Step | Training Loss | dim_768_cosine_map@100 |
|---|---|---|---|
| 0 | 0 | - | 0.3930 |
| 0.0441 | 10 | 1.18 | - |
| 0.0881 | 20 | 1.0507 | - |
| 0.1322 | 30 | 0.9049 | - |
| 0.1762 | 40 | 0.8999 | - |
| 0.2203 | 50 | 0.6519 | - |
| 0.2643 | 60 | 0.5479 | - |
| 0.3084 | 70 | 0.6493 | - |
| 0.3524 | 80 | 0.4706 | - |
| 0.3965 | 90 | 0.5459 | - |
| 0.4405 | 100 | 0.5692 | - |
| 0.4846 | 110 | 0.7834 | - |
| 0.5286 | 120 | 0.5341 | - |
| 0.5727 | 130 | 0.5343 | - |
| 0.6167 | 140 | 0.4865 | - |
| 0.6608 | 150 | 0.3942 | - |
| 0.7048 | 160 | 0.3578 | - |
| 0.7489 | 170 | 0.5158 | - |
| 0.7930 | 180 | 0.3426 | - |
| 0.8370 | 190 | 0.5789 | - |
| 0.8811 | 200 | 0.5271 | - |
| 0.9251 | 210 | 0.577 | - |
| 0.9692 | 220 | 0.5193 | - |
| 1.0 | 227 | - | 0.4354 |
| 1.0132 | 230 | 0.4598 | - |
| 1.0573 | 240 | 0.2735 | - |
| 1.1013 | 250 | 0.2919 | - |
| 1.1454 | 260 | 0.3206 | - |
| 1.1894 | 270 | 0.2851 | - |
| 1.2335 | 280 | 0.3899 | - |
| 1.2775 | 290 | 0.3279 | - |
| 1.3216 | 300 | 0.2155 | - |
| 1.3656 | 310 | 0.3471 | - |
| 1.4097 | 320 | 0.327 | - |
| 1.4537 | 330 | 0.229 | - |
| 1.4978 | 340 | 0.2902 | - |
| 1.5419 | 350 | 0.3216 | - |
| 1.5859 | 360 | 0.2902 | - |
| 1.6300 | 370 | 0.4527 | - |
| 1.6740 | 380 | 0.1583 | - |
| 1.7181 | 390 | 0.3144 | - |
| 1.7621 | 400 | 0.2573 | - |
| 1.8062 | 410 | 0.2309 | - |
| 1.8502 | 420 | 0.3475 | - |
| 1.8943 | 430 | 0.3082 | - |
| 1.9383 | 440 | 0.3176 | - |
| 1.9824 | 450 | 0.2104 | - |
| 2.0 | 454 | - | 0.4453 |
| 2.0264 | 460 | 0.2615 | - |
| 2.0705 | 470 | 0.1599 | - |
| 2.1145 | 480 | 0.1015 | - |
| 2.1586 | 490 | 0.2154 | - |
| 2.2026 | 500 | 0.1161 | - |
| 2.2467 | 510 | 0.2208 | - |
| 2.2907 | 520 | 0.2035 | - |
| 2.3348 | 530 | 0.1622 | - |
| 2.3789 | 540 | 0.1758 | - |
| 2.4229 | 550 | 0.2782 | - |
| 2.4670 | 560 | 0.303 | - |
| 2.5110 | 570 | 0.1787 | - |
| 2.5551 | 580 | 0.2221 | - |
| 2.5991 | 590 | 0.1686 | - |
| 2.6432 | 600 | 0.2522 | - |
| 2.6872 | 610 | 0.1334 | - |
| 2.7313 | 620 | 0.1102 | - |
| 2.7753 | 630 | 0.2499 | - |
| 2.8194 | 640 | 0.2648 | - |
| 2.8634 | 650 | 0.1859 | - |
| 2.9075 | 660 | 0.2385 | - |
| 2.9515 | 670 | 0.2283 | - |
| 2.9956 | 680 | 0.1126 | - |
| 3.0 | 681 | - | 0.4462 |
@inproceedings{reimers-2019-sentence-bert,
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
author = "Reimers, Nils and Gurevych, Iryna",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
month = "11",
year = "2019",
publisher = "Association for Computational Linguistics",
url = "https://arxiv.org/abs/1908.10084",
}
@misc{kusupati2024matryoshka,
title={Matryoshka Representation Learning},
author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
year={2024},
eprint={2205.13147},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
@misc{henderson2017efficient,
title={Efficient Natural Language Response Suggestion for Smart Reply},
author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
year={2017},
eprint={1705.00652},
archivePrefix={arXiv},
primaryClass={cs.CL}
}