--- tags: - ColBERT - PyLate - sentence-transformers - sentence-similarity - feature-extraction - generated_from_trainer - dataset_size:9998000 - loss:XTRPrimeQA base_model: BAAI/bge-small-en-v1.5 datasets: - bclavie/msmarco-10m-triplets pipeline_tag: sentence-similarity library_name: PyLate metrics: - accuracy model-index: - name: PyLate model based on BAAI/bge-small-en-v1.5 results: - task: type: col-berttriplet name: Col BERTTriplet dataset: name: Unknown type: unknown metrics: - type: accuracy value: 0.9925000667572021 name: Accuracy --- # PyLate model based on BAAI/bge-small-en-v1.5 This is a [PyLate](https://github.com/lightonai/pylate) model finetuned from [BAAI/bge-small-en-v1.5](https://huggingface.co/BAAI/bge-small-en-v1.5) on the [msmarco-10m-triplets](https://huggingface.co/datasets/bclavie/msmarco-10m-triplets) dataset. It maps sentences & paragraphs to sequences of 128-dimensional dense vectors and can be used for semantic textual similarity using the MaxSim operator. ## Model Details ### Model Description - **Model Type:** PyLate model - **Base model:** [BAAI/bge-small-en-v1.5](https://huggingface.co/BAAI/bge-small-en-v1.5) - **Document Length:** 300 tokens - **Query Length:** 32 tokens - **Output Dimensionality:** 128 tokens - **Similarity Function:** MaxSim - **Training Dataset:** - [msmarco-10m-triplets](https://huggingface.co/datasets/bclavie/msmarco-10m-triplets) ### Model Sources - **Documentation:** [PyLate Documentation](https://lightonai.github.io/pylate/) - **Repository:** [PyLate on GitHub](https://github.com/lightonai/pylate) - **Hugging Face:** [PyLate models on Hugging Face](https://huggingface.co/models?library=PyLate) ### Full Model Architecture ``` ColBERT( (0): Transformer({'max_seq_length': 300, 'do_lower_case': True, 'architecture': 'BertModel'}) (1): Dense({'in_features': 384, 'out_features': 128, 'bias': False, 'activation_function': 'torch.nn.modules.linear.Identity', 'use_residual': False}) ) ``` ## Usage First install the PyLate library: ```bash pip install -U pylate ``` ### Retrieval Use this model with PyLate to index and retrieve documents. The index uses [FastPLAID](https://github.com/lightonai/fast-plaid) for efficient similarity search. #### Indexing documents Load the ColBERT model and initialize the PLAID index, then encode and index your documents: ```python from pylate import indexes, models, retrieve # Step 1: Load the ColBERT model model = models.ColBERT( model_name_or_path="pylate_model_id", ) # Step 2: Initialize the PLAID index index = indexes.PLAID( index_folder="pylate-index", index_name="index", override=True, # This overwrites the existing index if any ) # Step 3: Encode the documents documents_ids = ["1", "2", "3"] documents = ["document 1 text", "document 2 text", "document 3 text"] documents_embeddings = model.encode( documents, batch_size=32, is_query=False, # Ensure that it is set to False to indicate that these are documents, not queries show_progress_bar=True, ) # Step 4: Add document embeddings to the index by providing embeddings and corresponding ids index.add_documents( documents_ids=documents_ids, documents_embeddings=documents_embeddings, ) ``` Note that you do not have to recreate the index and encode the documents every time. Once you have created an index and added the documents, you can re-use the index later by loading it: ```python # To load an index, simply instantiate it with the correct folder/name and without overriding it index = indexes.PLAID( index_folder="pylate-index", index_name="index", ) ``` #### Retrieving top-k documents for queries Once the documents are indexed, you can retrieve the top-k most relevant documents for a given set of queries. To do so, initialize the ColBERT retriever with the index you want to search in, encode the queries and then retrieve the top-k documents to get the top matches ids and relevance scores: ```python # Step 1: Initialize the ColBERT retriever retriever = retrieve.ColBERT(index=index) # Step 2: Encode the queries queries_embeddings = model.encode( ["query for document 3", "query for document 1"], batch_size=32, is_query=True, # # Ensure that it is set to False to indicate that these are queries show_progress_bar=True, ) # Step 3: Retrieve top-k documents scores = retriever.retrieve( queries_embeddings=queries_embeddings, k=10, # Retrieve the top 10 matches for each query ) ``` ### Reranking If you only want to use the ColBERT model to perform reranking on top of your first-stage retrieval pipeline without building an index, you can simply use rank function and pass the queries and documents to rerank: ```python from pylate import rank, models queries = [ "query A", "query B", ] documents = [ ["document A", "document B"], ["document 1", "document C", "document B"], ] documents_ids = [ [1, 2], [1, 3, 2], ] model = models.ColBERT( model_name_or_path="pylate_model_id", ) queries_embeddings = model.encode( queries, is_query=True, ) documents_embeddings = model.encode( documents, is_query=False, ) reranked_documents = rank.rerank( documents_ids=documents_ids, queries_embeddings=queries_embeddings, documents_embeddings=documents_embeddings, ) ``` ## Evaluation ### Metrics #### Col BERTTriplet * Evaluated with pylate.evaluation.colbert_triplet.ColBERTTripletEvaluator | Metric | Value | |:-------------|:-----------| | **accuracy** | **0.9925** | ## Training Details ### Training Dataset #### msmarco-10m-triplets * Dataset: [msmarco-10m-triplets](https://huggingface.co/datasets/bclavie/msmarco-10m-triplets) at [8c5139a](https://huggingface.co/datasets/bclavie/msmarco-10m-triplets/tree/8c5139a245a5997992605792faa49ec12a6eb5f2) * Size: 9,998,000 training samples * Columns: query, positive, and negative * Approximate statistics based on the first 1000 samples: | | query | positive | negative | |:--------|:----------------------------------------------------------------------------------|:----------------------------------------------------------------------------------|:----------------------------------------------------------------------------------| | type | string | string | string | | details | | | | * Samples: | query | positive | negative | |:-------------------------------------------------------|:--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | what can black mold do | Two of the better-known toxic molds include Stachybotrys chartarum (black mold), which can cause everything from headaches to cancer, and Aspergillus, which can cause severe lung infections, or progress to whole-body infections. Mold is particularly dangerous for infants and children. | Learn more about mold and health effects in a A Brief Guide to Mold, Moisture and Your Home The entire booklet: A Brief Guide to Mold, Moisture and Your Home (Web Version) A Brief Guide to Mold, Moisture and Your Home (Print Version) Top of Page | | what factors increase a population growth | Quick Answer. Factors that cause population growth include increased food production, improved health care services, immigration and high birth rate. These factors have led to overpopulation, which has more negative effects than positive impacts. | The increase in crawfish size during molting, and the length of time between molts, can vary greatly and are affected by factors such as water temperature, water quality, food quality and quantity, population density, oxygen levels and to a lesser extent by genetic influences. | | is herpes spread through saliva | This type of herpes is transmittable through contact with the saliva or the herpes blisters (cold sores) of an infected person. This said – yes, it is entirely possible to get herpes from kissing.It is also possible, though less common, that herpes type 1 might spread to genital regions through oral sex.enital herpes can spread to the mouth through oral sex. Once you have contracted either type of herpes virus you will be a carrier for life. However, both types tend to become less severe with the passing of time and though they may still be contagious to others, many times people stop having breakouts at all. | Introduction. Herpes simplex virus (HSV) infections are very common worldwide. HSV-1 is the main cause of herpes infections on the mouth and lips, including cold sores and fever blisters. It is transmitted through kissing or sharing drinking glasses and utensils.HSV-1 can also cause genital herpes, although HSV-2 is the main cause of genital herpes.HSV-2 is spread through sexual contact.You may be infected with HSV-1 or HSV-2 but not show any symptoms. Often symptoms are triggered by exposure to the sun, fever, menstruation, emotional stress, a weakened immune system, or an illness. There is no cure for herpes, and once you have it, it is likely to come back.ntroduction. Herpes simplex virus (HSV) infections are very common worldwide. HSV-1 is the main cause of herpes infections on the mouth and lips, including cold sores and fever blisters. It is transmitted through kissing or sharing drinking glasses and utensils. | * Loss: pylate.losses.xtr_primeqa.XTRPrimeQA ### Evaluation Dataset #### msmarco-10m-triplets * Dataset: [msmarco-10m-triplets](https://huggingface.co/datasets/bclavie/msmarco-10m-triplets) at [8c5139a](https://huggingface.co/datasets/bclavie/msmarco-10m-triplets/tree/8c5139a245a5997992605792faa49ec12a6eb5f2) * Size: 2,000 evaluation samples * Columns: query, positive, and negative * Approximate statistics based on the first 1000 samples: | | query | positive | negative | |:--------|:----------------------------------------------------------------------------------|:----------------------------------------------------------------------------------|:----------------------------------------------------------------------------------| | type | string | string | string | | details | | | | * Samples: | query | positive | negative | |:--------------------------------------------------------|:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | what school district is siefert elementary | Siefert Elementary is a public elementary school located in Milwaukee, WI in the Milwaukee School District. It enrolls 307 students in grades 1st through 12th. Siefert Elementary is the 756th largest public school in Wisconsin and the 40,598th largest nationally. It has 17.5 students to every teacher. | Due to the hazardous road conditions, there will be a two hour delay today, Wednesday, March 2nd, 2016 for Black River HS/MS, Cavendish Town Elementary School, Chester-Andover Elementary School, Green Mountain Union HS, Ludlow Elementary School and Mount Holly School. | | what is dst | Daylight Saving Time (DST) is the practice of turning the clock ahead as warmer weather approaches and back as it becomes colder again so that people will have one more hour of daylight in the afternoon and evening during the warmer season of the year. | Franklin Park time zone: UTC-05:00 or CDT. Daylight saving time is in effect in Franklin Park. See recent and expected DST changes in Franklin Park in the table below. | | who played sister sister mom | The main cast of Sister, Sister (from left to right), Tia Mowry with Jackée Harry as Tia and Lisa Landry and Tim Reid with Tamera Mowry as Ray and Tamera Campbell. Sister, Sister is an American television sitcom starring fraternal twins Tia and Tamera Mowry. It aired from 1994 to 1999. | The boy, 16, was killed by his sister, 15, on January 5 after she escaped the room her brother had locked her in and shot him in the neck. The older sister asked her younger sister to keep watch as she went outside and cut out the air conditioner of her parents' locked bedroom window to retrieve a pistol. | * Loss: pylate.losses.xtr_primeqa.XTRPrimeQA ### Training Hyperparameters #### Non-Default Hyperparameters - `eval_strategy`: steps - `per_device_train_batch_size`: 196 - `per_device_eval_batch_size`: 196 - `learning_rate`: 3e-05 - `max_grad_norm`: 10.0 - `num_train_epochs`: 0 - `max_steps`: 50000 - `warmup_ratio`: 0.01 - `bf16`: True - `torch_compile`: True - `torch_compile_backend`: inductor - `eval_on_start`: True #### All Hyperparameters
Click to expand - `overwrite_output_dir`: False - `do_predict`: False - `eval_strategy`: steps - `prediction_loss_only`: True - `per_device_train_batch_size`: 196 - `per_device_eval_batch_size`: 196 - `per_gpu_train_batch_size`: None - `per_gpu_eval_batch_size`: None - `gradient_accumulation_steps`: 1 - `eval_accumulation_steps`: None - `torch_empty_cache_steps`: None - `learning_rate`: 3e-05 - `weight_decay`: 0.0 - `adam_beta1`: 0.9 - `adam_beta2`: 0.999 - `adam_epsilon`: 1e-08 - `max_grad_norm`: 10.0 - `num_train_epochs`: 0 - `max_steps`: 50000 - `lr_scheduler_type`: linear - `lr_scheduler_kwargs`: {} - `warmup_ratio`: 0.01 - `warmup_steps`: 0 - `log_level`: passive - `log_level_replica`: warning - `log_on_each_node`: True - `logging_nan_inf_filter`: True - `save_safetensors`: True - `save_on_each_node`: False - `save_only_model`: False - `restore_callback_states_from_checkpoint`: False - `no_cuda`: False - `use_cpu`: False - `use_mps_device`: False - `seed`: 42 - `data_seed`: None - `jit_mode_eval`: False - `use_ipex`: False - `bf16`: True - `fp16`: False - `fp16_opt_level`: O1 - `half_precision_backend`: auto - `bf16_full_eval`: False - `fp16_full_eval`: False - `tf32`: None - `local_rank`: 0 - `ddp_backend`: None - `tpu_num_cores`: None - `tpu_metrics_debug`: False - `debug`: [] - `dataloader_drop_last`: False - `dataloader_num_workers`: 0 - `dataloader_prefetch_factor`: None - `past_index`: -1 - `disable_tqdm`: False - `remove_unused_columns`: True - `label_names`: None - `load_best_model_at_end`: False - `ignore_data_skip`: False - `fsdp`: [] - `fsdp_min_num_params`: 0 - `fsdp_config`: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False} - `fsdp_transformer_layer_cls_to_wrap`: None - `accelerator_config`: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None} - `parallelism_config`: None - `deepspeed`: None - `label_smoothing_factor`: 0.0 - `optim`: adamw_torch_fused - `optim_args`: None - `adafactor`: False - `group_by_length`: False - `length_column_name`: length - `ddp_find_unused_parameters`: None - `ddp_bucket_cap_mb`: None - `ddp_broadcast_buffers`: False - `dataloader_pin_memory`: True - `dataloader_persistent_workers`: False - `skip_memory_metrics`: True - `use_legacy_prediction_loop`: False - `push_to_hub`: False - `resume_from_checkpoint`: None - `hub_model_id`: None - `hub_strategy`: every_save - `hub_private_repo`: None - `hub_always_push`: False - `hub_revision`: None - `gradient_checkpointing`: False - `gradient_checkpointing_kwargs`: None - `include_inputs_for_metrics`: False - `include_for_metrics`: [] - `eval_do_concat_batches`: True - `fp16_backend`: auto - `push_to_hub_model_id`: None - `push_to_hub_organization`: None - `mp_parameters`: - `auto_find_batch_size`: False - `full_determinism`: False - `torchdynamo`: None - `ray_scope`: last - `ddp_timeout`: 1800 - `torch_compile`: True - `torch_compile_backend`: inductor - `torch_compile_mode`: None - `include_tokens_per_second`: False - `include_num_input_tokens_seen`: False - `neftune_noise_alpha`: None - `optim_target_modules`: None - `batch_eval_metrics`: False - `eval_on_start`: True - `use_liger_kernel`: False - `liger_kernel_config`: None - `eval_use_gather_object`: False - `average_tokens_across_devices`: False - `prompts`: None - `batch_sampler`: batch_sampler - `multi_dataset_batch_sampler`: proportional - `router_mapping`: {} - `learning_rate_mapping`: {}