File size: 25,368 Bytes

86d5653

---
tags:
- ColBERT
- PyLate
- sentence-transformers
- sentence-similarity
- feature-extraction
- generated_from_trainer
- dataset_size:9998000
- loss:XTRPrimeQA
base_model: BAAI/bge-small-en-v1.5
datasets:
- bclavie/msmarco-10m-triplets
pipeline_tag: sentence-similarity
library_name: PyLate
metrics:
- accuracy
model-index:
- name: PyLate model based on BAAI/bge-small-en-v1.5
  results:
  - task:
      type: col-berttriplet
      name: Col BERTTriplet
    dataset:
      name: Unknown
      type: unknown
    metrics:
    - type: accuracy
      value: 0.9925000667572021
      name: Accuracy
---

# PyLate model based on BAAI/bge-small-en-v1.5

This is a [PyLate](https://github.com/lightonai/pylate) model finetuned from [BAAI/bge-small-en-v1.5](https://huggingface.co/BAAI/bge-small-en-v1.5) on the [msmarco-10m-triplets](https://huggingface.co/datasets/bclavie/msmarco-10m-triplets) dataset. It maps sentences & paragraphs to sequences of 128-dimensional dense vectors and can be used for semantic textual similarity using the MaxSim operator.

## Model Details

### Model Description
- **Model Type:** PyLate model
- **Base model:** [BAAI/bge-small-en-v1.5](https://huggingface.co/BAAI/bge-small-en-v1.5) <!-- at revision 5c38ec7c405ec4b44b94cc5a9bb96e735b38267a -->
- **Document Length:** 300 tokens
- **Query Length:** 32 tokens
- **Output Dimensionality:** 128 tokens
- **Similarity Function:** MaxSim
- **Training Dataset:**
    - [msmarco-10m-triplets](https://huggingface.co/datasets/bclavie/msmarco-10m-triplets)
<!-- - **Language:** Unknown -->
<!-- - **License:** Unknown -->

### Model Sources

- **Documentation:** [PyLate Documentation](https://lightonai.github.io/pylate/)
- **Repository:** [PyLate on GitHub](https://github.com/lightonai/pylate)
- **Hugging Face:** [PyLate models on Hugging Face](https://huggingface.co/models?library=PyLate)

### Full Model Architecture

```
ColBERT(
  (0): Transformer({'max_seq_length': 300, 'do_lower_case': True, 'architecture': 'BertModel'})
  (1): Dense({'in_features': 384, 'out_features': 128, 'bias': False, 'activation_function': 'torch.nn.modules.linear.Identity', 'use_residual': False})
)
```

## Usage
First install the PyLate library:

```bash
pip install -U pylate
```

### Retrieval

Use this model with PyLate to index and retrieve documents. The index uses [FastPLAID](https://github.com/lightonai/fast-plaid) for efficient similarity search.

#### Indexing documents

Load the ColBERT model and initialize the PLAID index, then encode and index your documents:

```python
from pylate import indexes, models, retrieve

# Step 1: Load the ColBERT model
model = models.ColBERT(
    model_name_or_path="pylate_model_id",
)

# Step 2: Initialize the PLAID index
index = indexes.PLAID(
    index_folder="pylate-index",
    index_name="index",
    override=True,  # This overwrites the existing index if any
)

# Step 3: Encode the documents
documents_ids = ["1", "2", "3"]
documents = ["document 1 text", "document 2 text", "document 3 text"]

documents_embeddings = model.encode(
    documents,
    batch_size=32,
    is_query=False,  # Ensure that it is set to False to indicate that these are documents, not queries
    show_progress_bar=True,
)

# Step 4: Add document embeddings to the index by providing embeddings and corresponding ids
index.add_documents(
    documents_ids=documents_ids,
    documents_embeddings=documents_embeddings,
)
```

Note that you do not have to recreate the index and encode the documents every time. Once you have created an index and added the documents, you can re-use the index later by loading it:

```python
# To load an index, simply instantiate it with the correct folder/name and without overriding it
index = indexes.PLAID(
    index_folder="pylate-index",
    index_name="index",
)
```

#### Retrieving top-k documents for queries

Once the documents are indexed, you can retrieve the top-k most relevant documents for a given set of queries.
To do so, initialize the ColBERT retriever with the index you want to search in, encode the queries and then retrieve the top-k documents to get the top matches ids and relevance scores:

```python
# Step 1: Initialize the ColBERT retriever
retriever = retrieve.ColBERT(index=index)

# Step 2: Encode the queries
queries_embeddings = model.encode(
    ["query for document 3", "query for document 1"],
    batch_size=32,
    is_query=True,  #  # Ensure that it is set to False to indicate that these are queries
    show_progress_bar=True,
)

# Step 3: Retrieve top-k documents
scores = retriever.retrieve(
    queries_embeddings=queries_embeddings,
    k=10,  # Retrieve the top 10 matches for each query
)
```

### Reranking
If you only want to use the ColBERT model to perform reranking on top of your first-stage retrieval pipeline without building an index, you can simply use rank function and pass the queries and documents to rerank:

```python
from pylate import rank, models

queries = [
    "query A",
    "query B",
]

documents = [
    ["document A", "document B"],
    ["document 1", "document C", "document B"],
]

documents_ids = [
    [1, 2],
    [1, 3, 2],
]

model = models.ColBERT(
    model_name_or_path="pylate_model_id",
)

queries_embeddings = model.encode(
    queries,
    is_query=True,
)

documents_embeddings = model.encode(
    documents,
    is_query=False,
)

reranked_documents = rank.rerank(
    documents_ids=documents_ids,
    queries_embeddings=queries_embeddings,
    documents_embeddings=documents_embeddings,
)
```

<!--
### Direct Usage (Transformers)

<details><summary>Click to see the direct usage in Transformers</summary>

</details>
-->

<!--
### Downstream Usage (Sentence Transformers)

You can finetune this model on your own dataset.

<details><summary>Click to expand</summary>

</details>
-->

<!--
### Out-of-Scope Use

*List how the model may foreseeably be misused and address what users ought not to do with the model.*
-->

## Evaluation

### Metrics

#### Col BERTTriplet

* Evaluated with <code>pylate.evaluation.colbert_triplet.ColBERTTripletEvaluator</code>

| Metric       | Value      |
|:-------------|:-----------|
| **accuracy** | **0.9925** |

<!--
## Bias, Risks and Limitations

*What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
-->

<!--
### Recommendations

*What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
-->

## Training Details

### Training Dataset

#### msmarco-10m-triplets

* Dataset: [msmarco-10m-triplets](https://huggingface.co/datasets/bclavie/msmarco-10m-triplets) at [8c5139a](https://huggingface.co/datasets/bclavie/msmarco-10m-triplets/tree/8c5139a245a5997992605792faa49ec12a6eb5f2)
* Size: 9,998,000 training samples
* Columns: <code>query</code>, <code>positive</code>, and <code>negative</code>
* Approximate statistics based on the first 1000 samples:
  |         | query                                                                             | positive                                                                          | negative                                                                          |
  |:--------|:----------------------------------------------------------------------------------|:----------------------------------------------------------------------------------|:----------------------------------------------------------------------------------|
  | type    | string                                                                            | string                                                                            | string                                                                            |
  | details | <ul><li>min: 32 tokens</li><li>mean: 32.0 tokens</li><li>max: 32 tokens</li></ul> | <ul><li>min: 32 tokens</li><li>mean: 32.0 tokens</li><li>max: 32 tokens</li></ul> | <ul><li>min: 32 tokens</li><li>mean: 32.0 tokens</li><li>max: 32 tokens</li></ul> |
* Samples:
  | query                                                  | positive                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                | negative                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
  |:-------------------------------------------------------|:--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
  | <code>what can black mold do</code>                    | <code>Two of the better-known toxic molds include Stachybotrys chartarum (black mold), which can cause everything from headaches to cancer, and Aspergillus, which can cause severe lung infections, or progress to whole-body infections. Mold is particularly dangerous for infants and children.</code>                                                                                                                                                                                                                                                                                                                                              | <code>Learn more about mold and health effects in a A Brief Guide to Mold, Moisture and Your Home The entire booklet: A Brief Guide to Mold, Moisture and Your Home (Web Version) A Brief Guide to Mold, Moisture and Your Home (Print Version) Top of Page</code>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             |
  | <code>what factors increase a population growth</code> | <code>Quick Answer. Factors that cause population growth include increased food production, improved health care services, immigration and high birth rate. These factors have led to overpopulation, which has more negative effects than positive impacts.</code>                                                                                                                                                                                                                                                                                                                                                                                     | <code>The increase in crawfish size during molting, and the length of time between molts, can vary greatly and are affected by factors such as water temperature, water quality, food quality and quantity, population density, oxygen levels and to a lesser extent by genetic influences.</code>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             |
  | <code>is herpes spread through saliva</code>           | <code>This type of herpes is transmittable through contact with the saliva or the herpes blisters (cold sores) of an infected person. This said â yes, it is entirely possible to get herpes from kissing.It is also possible, though less common, that herpes type 1 might spread to genital regions through oral sex.enital herpes can spread to the mouth through oral sex. Once you have contracted either type of herpes virus you will be a carrier for life. However, both types tend to become less severe with the passing of time and though they may still be contagious to others, many times people stop having breakouts at all.</code> | <code>Introduction. Herpes simplex virus (HSV) infections are very common worldwide. HSV-1 is the main cause of herpes infections on the mouth and lips, including cold sores and fever blisters. It is transmitted through kissing or sharing drinking glasses and utensils.HSV-1 can also cause genital herpes, although HSV-2 is the main cause of genital herpes.HSV-2 is spread through sexual contact.You may be infected with HSV-1 or HSV-2 but not show any symptoms. Often symptoms are triggered by exposure to the sun, fever, menstruation, emotional stress, a weakened immune system, or an illness. There is no cure for herpes, and once you have it, it is likely to come back.ntroduction. Herpes simplex virus (HSV) infections are very common worldwide. HSV-1 is the main cause of herpes infections on the mouth and lips, including cold sores and fever blisters. It is transmitted through kissing or sharing drinking glasses and utensils.</code> |
* Loss: <code>pylate.losses.xtr_primeqa.XTRPrimeQA</code>

### Evaluation Dataset

#### msmarco-10m-triplets

* Dataset: [msmarco-10m-triplets](https://huggingface.co/datasets/bclavie/msmarco-10m-triplets) at [8c5139a](https://huggingface.co/datasets/bclavie/msmarco-10m-triplets/tree/8c5139a245a5997992605792faa49ec12a6eb5f2)
* Size: 2,000 evaluation samples
* Columns: <code>query</code>, <code>positive</code>, and <code>negative</code>
* Approximate statistics based on the first 1000 samples:
  |         | query                                                                             | positive                                                                          | negative                                                                          |
  |:--------|:----------------------------------------------------------------------------------|:----------------------------------------------------------------------------------|:----------------------------------------------------------------------------------|
  | type    | string                                                                            | string                                                                            | string                                                                            |
  | details | <ul><li>min: 32 tokens</li><li>mean: 32.0 tokens</li><li>max: 32 tokens</li></ul> | <ul><li>min: 32 tokens</li><li>mean: 32.0 tokens</li><li>max: 32 tokens</li></ul> | <ul><li>min: 32 tokens</li><li>mean: 32.0 tokens</li><li>max: 32 tokens</li></ul> |
* Samples:
  | query                                                   | positive                                                                                                                                                                                                                                                                                                                    | negative                                                                                                                                                                                                                                                                                                                        |
  |:--------------------------------------------------------|:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
  | <code>what school district is siefert elementary</code> | <code>Siefert Elementary is a public elementary school located in Milwaukee, WI in the Milwaukee School District. It enrolls 307 students in grades 1st through 12th. Siefert Elementary is the 756th largest public school in Wisconsin and the 40,598th largest nationally. It has 17.5 students to every teacher.</code> | <code>Due to the hazardous road conditions, there will be a two hour delay today, Wednesday, March 2nd, 2016 for Black River HS/MS, Cavendish Town Elementary School, Chester-Andover Elementary School, Green Mountain Union HS, Ludlow Elementary School and Mount Holly School.</code>                                       |
  | <code>what is dst</code>                                | <code>Daylight Saving Time (DST) is the practice of turning the clock ahead as warmer weather approaches and back as it becomes colder again so that people will have one more hour of daylight in the afternoon and evening during the warmer season of the year.</code>                                                   | <code>Franklin Park time zone: UTC-05:00 or CDT. Daylight saving time is in effect in Franklin Park. See recent and expected DST changes in Franklin Park in the table below.</code>                                                                                                                                            |
  | <code>who played sister sister mom</code>               | <code>The main cast of Sister, Sister (from left to right), Tia Mowry with JackÃ©e Harry as Tia and Lisa Landry and Tim Reid with Tamera Mowry as Ray and Tamera Campbell. Sister, Sister is an American television sitcom starring fraternal twins Tia and Tamera Mowry. It aired from 1994 to 1999.</code>                | <code>The boy, 16, was killed by his sister, 15, on January 5 after she escaped the room her brother had locked her in and shot him in the neck. The older sister asked her younger sister to keep watch as she went outside and cut out the air conditioner of her parents' locked bedroom window to retrieve a pistol.</code> |
* Loss: <code>pylate.losses.xtr_primeqa.XTRPrimeQA</code>

### Training Hyperparameters
#### Non-Default Hyperparameters

- `eval_strategy`: steps
- `per_device_train_batch_size`: 196
- `per_device_eval_batch_size`: 196
- `learning_rate`: 3e-05
- `max_grad_norm`: 10.0
- `num_train_epochs`: 0
- `max_steps`: 50000
- `warmup_ratio`: 0.01
- `bf16`: True
- `torch_compile`: True
- `torch_compile_backend`: inductor
- `eval_on_start`: True

#### All Hyperparameters
<details><summary>Click to expand</summary>

- `overwrite_output_dir`: False
- `do_predict`: False
- `eval_strategy`: steps
- `prediction_loss_only`: True
- `per_device_train_batch_size`: 196
- `per_device_eval_batch_size`: 196
- `per_gpu_train_batch_size`: None
- `per_gpu_eval_batch_size`: None
- `gradient_accumulation_steps`: 1
- `eval_accumulation_steps`: None
- `torch_empty_cache_steps`: None
- `learning_rate`: 3e-05
- `weight_decay`: 0.0
- `adam_beta1`: 0.9
- `adam_beta2`: 0.999
- `adam_epsilon`: 1e-08
- `max_grad_norm`: 10.0
- `num_train_epochs`: 0
- `max_steps`: 50000
- `lr_scheduler_type`: linear
- `lr_scheduler_kwargs`: {}
- `warmup_ratio`: 0.01
- `warmup_steps`: 0
- `log_level`: passive
- `log_level_replica`: warning
- `log_on_each_node`: True
- `logging_nan_inf_filter`: True
- `save_safetensors`: True
- `save_on_each_node`: False
- `save_only_model`: False
- `restore_callback_states_from_checkpoint`: False
- `no_cuda`: False
- `use_cpu`: False
- `use_mps_device`: False
- `seed`: 42
- `data_seed`: None
- `jit_mode_eval`: False
- `use_ipex`: False
- `bf16`: True
- `fp16`: False
- `fp16_opt_level`: O1
- `half_precision_backend`: auto
- `bf16_full_eval`: False
- `fp16_full_eval`: False
- `tf32`: None
- `local_rank`: 0
- `ddp_backend`: None
- `tpu_num_cores`: None
- `tpu_metrics_debug`: False
- `debug`: []
- `dataloader_drop_last`: False
- `dataloader_num_workers`: 0
- `dataloader_prefetch_factor`: None
- `past_index`: -1
- `disable_tqdm`: False
- `remove_unused_columns`: True
- `label_names`: None
- `load_best_model_at_end`: False
- `ignore_data_skip`: False
- `fsdp`: []
- `fsdp_min_num_params`: 0
- `fsdp_config`: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
- `fsdp_transformer_layer_cls_to_wrap`: None
- `accelerator_config`: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
- `parallelism_config`: None
- `deepspeed`: None
- `label_smoothing_factor`: 0.0
- `optim`: adamw_torch_fused
- `optim_args`: None
- `adafactor`: False
- `group_by_length`: False
- `length_column_name`: length
- `ddp_find_unused_parameters`: None
- `ddp_bucket_cap_mb`: None
- `ddp_broadcast_buffers`: False
- `dataloader_pin_memory`: True
- `dataloader_persistent_workers`: False
- `skip_memory_metrics`: True
- `use_legacy_prediction_loop`: False
- `push_to_hub`: False
- `resume_from_checkpoint`: None
- `hub_model_id`: None
- `hub_strategy`: every_save
- `hub_private_repo`: None
- `hub_always_push`: False
- `hub_revision`: None
- `gradient_checkpointing`: False
- `gradient_checkpointing_kwargs`: None
- `include_inputs_for_metrics`: False
- `include_for_metrics`: []
- `eval_do_concat_batches`: True
- `fp16_backend`: auto
- `push_to_hub_model_id`: None
- `push_to_hub_organization`: None
- `mp_parameters`: 
- `auto_find_batch_size`: False
- `full_determinism`: False
- `torchdynamo`: None
- `ray_scope`: last
- `ddp_timeout`: 1800
- `torch_compile`: True
- `torch_compile_backend`: inductor
- `torch_compile_mode`: None
- `include_tokens_per_second`: False
- `include_num_input_tokens_seen`: False
- `neftune_noise_alpha`: None
- `optim_target_modules`: None
- `batch_eval_metrics`: False
- `eval_on_start`: True
- `use_liger_kernel`: False
- `liger_kernel_config`: None
- `eval_use_gather_object`: False
- `average_tokens_across_devices`: False
- `prompts`: None
- `batch_sampler`: batch_sampler
- `multi_dataset_batch_sampler`: proportional
- `router_mapping`: {}
- `learning_rate_mapping`: {}

</details>