--- tags: - ColBERT - PyLate - sentence-transformers - sentence-similarity - feature-extraction - generated_from_trainer - dataset_size:9959 - loss:CachedContrastive pipeline_tag: sentence-similarity library_name: PyLate --- # PyLate This is a [PyLate](https://github.com/lightonai/pylate) model trained. It maps sentences & paragraphs to sequences of 128-dimensional dense vectors and can be used for semantic textual similarity using the MaxSim operator. ## Model Details ### Model Description - **Model Type:** PyLate model - **Document Length:** 512 tokens - **Query Length:** 128 tokens - **Output Dimensionality:** 128 tokens - **Similarity Function:** MaxSim ### Model Sources - **Documentation:** [PyLate Documentation](https://lightonai.github.io/pylate/) - **Repository:** [PyLate on GitHub](https://github.com/lightonai/pylate) - **Hugging Face:** [PyLate models on Hugging Face](https://huggingface.co/models?library=PyLate) ### Full Model Architecture ``` ColBERT( (0): Transformer({'max_seq_length': 127, 'do_lower_case': False}) with Transformer model: ModernBertModel (1): Dense({'in_features': 768, 'out_features': 128, 'bias': False, 'activation_function': 'torch.nn.modules.linear.Identity'}) ) ``` ## Usage First install the PyLate library: ```bash pip install -U pylate ``` ### Retrieval PyLate provides a streamlined interface to index and retrieve documents using ColBERT models. The index leverages the Voyager HNSW index to efficiently handle document embeddings and enable fast retrieval. #### Indexing documents First, load the ColBERT model and initialize the Voyager index, then encode and index your documents: ```python from pylate import indexes, models, retrieve # Step 1: Load the ColBERT model model = models.ColBERT( model_name_or_path=pylate_model_id, ) # Step 2: Initialize the Voyager index index = indexes.Voyager( index_folder="pylate-index", index_name="index", override=True, # This overwrites the existing index if any ) # Step 3: Encode the documents documents_ids = ["1", "2", "3"] documents = ["document 1 text", "document 2 text", "document 3 text"] documents_embeddings = model.encode( documents, batch_size=32, is_query=False, # Ensure that it is set to False to indicate that these are documents, not queries show_progress_bar=True, ) # Step 4: Add document embeddings to the index by providing embeddings and corresponding ids index.add_documents( documents_ids=documents_ids, documents_embeddings=documents_embeddings, ) ``` Note that you do not have to recreate the index and encode the documents every time. Once you have created an index and added the documents, you can re-use the index later by loading it: ```python # To load an index, simply instantiate it with the correct folder/name and without overriding it index = indexes.Voyager( index_folder="pylate-index", index_name="index", ) ``` #### Retrieving top-k documents for queries Once the documents are indexed, you can retrieve the top-k most relevant documents for a given set of queries. To do so, initialize the ColBERT retriever with the index you want to search in, encode the queries and then retrieve the top-k documents to get the top matches ids and relevance scores: ```python # Step 1: Initialize the ColBERT retriever retriever = retrieve.ColBERT(index=index) # Step 2: Encode the queries queries_embeddings = model.encode( ["query for document 3", "query for document 1"], batch_size=32, is_query=True, # # Ensure that it is set to False to indicate that these are queries show_progress_bar=True, ) # Step 3: Retrieve top-k documents scores = retriever.retrieve( queries_embeddings=queries_embeddings, k=10, # Retrieve the top 10 matches for each query ) ``` ### Reranking If you only want to use the ColBERT model to perform reranking on top of your first-stage retrieval pipeline without building an index, you can simply use rank function and pass the queries and documents to rerank: ```python from pylate import rank, models queries = [ "query A", "query B", ] documents = [ ["document A", "document B"], ["document 1", "document C", "document B"], ] documents_ids = [ [1, 2], [1, 3, 2], ] model = models.ColBERT( model_name_or_path=pylate_model_id, ) queries_embeddings = model.encode( queries, is_query=True, ) documents_embeddings = model.encode( documents, is_query=False, ) reranked_documents = rank.rerank( documents_ids=documents_ids, queries_embeddings=queries_embeddings, documents_embeddings=documents_embeddings, ) ``` ## Training Details ### Training Dataset #### Unnamed Dataset * Size: 9,959 training samples * Columns: query, positive, and negative * Approximate statistics based on the first 1000 samples: | | query | positive | negative | |:--------|:-------------------------------------------------------------------------------------|:-------------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------| | type | string | string | string | | details | | | | * Samples: | query | positive | negative | |:------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | Here is the step-by-step reasoning to identify the correct code solution for reading an OVF descriptor file with robust error handling.

### 1. Identify the Kind of Code
The code required is a **Python utility function** (or a small script) that performs **file I/O operations**. Specifically, it needs to:
* Accept a file path as an input argument.
* Attempt to open and read the contents of a file (likely a text-based XML or text file, as OVF descriptors are XML).
* Implement **exception handling** to gracefully manage scenarios where the file does not exist or cannot be read due to permissions or corruption.
* Return the file content (string) or a parsed object (if XML parsing is included), or raise a specific, user-friendly error.

### 2. Relevant Programming Concepts & Patterns
* **File I/O and Context Managers**: The code must use the `with open(...)` statement. This ensures the file handle is properly closed even if an error occurs during reading, preventing resource leak...
| def get_ovf_descriptor(ovf_path):
if path.exists(ovf_path):
with open(ovf_path, 'r') as f:
try:
ovfd = f.read()
f.close()
return ovfd
except:
print "Could not read file: %s" % ovf_path
exit(1)
| def read_vnf_descriptor(vnfd_id, vnf_vendor, vnf_version):
if _catalog_backend is not None:
return _catalog_backend.read_vnf_descriptor(vnfd_id, vnf_vendor,
vnf_version)
return None
| | Here is the step-by-step reasoning to identify the correct code solution for adding a custom 'Settings' link to the WordPress plugin action links.

### 1. What kind of code would answer this query?
The solution requires **PHP code** specifically designed for **WordPress plugin development**. It will not be a JavaScript snippet or a CSS style. The code must be a function that hooks into the WordPress plugin management system, likely using the `plugin_action_links_{plugin_basename}` filter.

### 2. Relevant Programming Concepts, Patterns, and Algorithms
* **WordPress Hooks (Filters):** The core mechanism is the `apply_filters()` system. Specifically, the dynamic filter `plugin_action_links_{plugin_basename}` allows developers to modify the array of action links (Activate, Deactivate, Edit, Delete, Settings) for a specific plugin.
* **Array Manipulation:** The action links are stored as an associative array where the key is the link text (or ID) and the value is the URL. The code must...
| public
function plugin_add_settings_link(
$links
) {
$settings_link_html = '' . __( 'Settings', 'link-linkid' ) . '';
array_unshift( $links, $settings_link_html );

return $links;
}
| function plugin_settings_link( $links){
$settings_link = 'Settings';
array_unshift($links, $settings_link);
return $links;
}
| | ### Reasoning Chain

1. **Identify the Goal**: The user wants to parse a JSON Web Token (JWT) in Go specifically to read the payload (claims) *without* performing the cryptographic signature verification. This is often needed for debugging, logging, or when the token is trusted from a different source (e.g., a trusted internal service) and signature validation is handled elsewhere.

2. **Analyze the JWT Structure**: A JWT consists of three parts: `header.payload.signature`. The `payload` is a JSON object containing the claims. To extract claims without verification, we need to:
* Decode the Base64URL-encoded payload.
* Unmarshal the JSON into a Go struct or `map[string]interface{}`.
* **Crucially**, skip the step where the library checks the signature against the provided key.

3. **Select the Library**: The standard library for JWT in Go is `github.com/golang-jwt/jwt/v5` (or the older `v4`). The older `jwt-go` library is deprecated.

4. **Determine the Implementa...
| func ParseInsecure(token string, audience []string) (*SVID, error) {
return parse(token, audience, func(tok *jwt.JSONWebToken, td spiffeid.TrustDomain) (map[string]interface{}, error) {
// Obtain the token claims insecurely, i.e. without signature verification
claimsMap := make(map[string]interface{})
if err := tok.UnsafeClaimsWithoutVerification(&claimsMap); err != nil {
return nil, jwtsvidErr.New("unable to get claims from token: %v", err)
}

return claimsMap, nil
})
}
| func ParseAndValidate(token string, bundles jwtbundle.Source, audience []string) (*SVID, error) {
return parse(token, audience, func(tok *jwt.JSONWebToken, trustDomain spiffeid.TrustDomain) (map[string]interface{}, error) {
// Obtain the key ID from the header
keyID := tok.Headers[0].KeyID
if keyID == "" {
return nil, jwtsvidErr.New("token header missing key id")
}

// Get JWT Bundle
bundle, err := bundles.GetJWTBundleForTrustDomain(trustDomain)
if err != nil {
return nil, jwtsvidErr.New("no bundle found for trust domain %q", trustDomain)
}

// Find JWT authority using the key ID from the token header
authority, ok := bundle.FindJWTAuthority(keyID)
if !ok {
return nil, jwtsvidErr.New("no JWT authority %q found for trust domain %q", keyID, trustDomain)
}

// Obtain and verify the token claims using the obtained JWT authority
claimsMap := make(map[string]interface{})
if err := tok.Claims(authority, &claimsMap); err != nil {
return nil, jwtsvidEr...
| * Loss: pylate.losses.cached_contrastive.CachedContrastive ### Training Hyperparameters #### Non-Default Hyperparameters - `per_device_train_batch_size`: 256 - `per_device_eval_batch_size`: 256 - `learning_rate`: 5e-06 - `warmup_ratio`: 0.05 - `bf16`: True - `tf32`: True - `dataloader_num_workers`: 8 - `dataloader_prefetch_factor`: 4 - `dataloader_persistent_workers`: True #### All Hyperparameters
Click to expand - `overwrite_output_dir`: False - `do_predict`: False - `eval_strategy`: no - `prediction_loss_only`: True - `per_device_train_batch_size`: 256 - `per_device_eval_batch_size`: 256 - `per_gpu_train_batch_size`: None - `per_gpu_eval_batch_size`: None - `gradient_accumulation_steps`: 1 - `eval_accumulation_steps`: None - `torch_empty_cache_steps`: None - `learning_rate`: 5e-06 - `weight_decay`: 0.0 - `adam_beta1`: 0.9 - `adam_beta2`: 0.999 - `adam_epsilon`: 1e-08 - `max_grad_norm`: 1.0 - `num_train_epochs`: 3 - `max_steps`: -1 - `lr_scheduler_type`: linear - `lr_scheduler_kwargs`: {} - `warmup_ratio`: 0.05 - `warmup_steps`: 0 - `log_level`: passive - `log_level_replica`: warning - `log_on_each_node`: True - `logging_nan_inf_filter`: True - `save_safetensors`: True - `save_on_each_node`: False - `save_only_model`: False - `restore_callback_states_from_checkpoint`: False - `no_cuda`: False - `use_cpu`: False - `use_mps_device`: False - `seed`: 42 - `data_seed`: None - `jit_mode_eval`: False - `use_ipex`: False - `bf16`: True - `fp16`: False - `fp16_opt_level`: O1 - `half_precision_backend`: auto - `bf16_full_eval`: False - `fp16_full_eval`: False - `tf32`: True - `local_rank`: 0 - `ddp_backend`: None - `tpu_num_cores`: None - `tpu_metrics_debug`: False - `debug`: [] - `dataloader_drop_last`: False - `dataloader_num_workers`: 8 - `dataloader_prefetch_factor`: 4 - `past_index`: -1 - `disable_tqdm`: False - `remove_unused_columns`: True - `label_names`: None - `load_best_model_at_end`: False - `ignore_data_skip`: False - `fsdp`: [] - `fsdp_min_num_params`: 0 - `fsdp_config`: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False} - `fsdp_transformer_layer_cls_to_wrap`: None - `accelerator_config`: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None} - `deepspeed`: None - `label_smoothing_factor`: 0.0 - `optim`: adamw_torch - `optim_args`: None - `adafactor`: False - `group_by_length`: False - `length_column_name`: length - `ddp_find_unused_parameters`: None - `ddp_bucket_cap_mb`: None - `ddp_broadcast_buffers`: False - `dataloader_pin_memory`: True - `dataloader_persistent_workers`: True - `skip_memory_metrics`: True - `use_legacy_prediction_loop`: False - `push_to_hub`: False - `resume_from_checkpoint`: None - `hub_model_id`: None - `hub_strategy`: every_save - `hub_private_repo`: None - `hub_always_push`: False - `gradient_checkpointing`: False - `gradient_checkpointing_kwargs`: None - `include_inputs_for_metrics`: False - `include_for_metrics`: [] - `eval_do_concat_batches`: True - `fp16_backend`: auto - `push_to_hub_model_id`: None - `push_to_hub_organization`: None - `mp_parameters`: - `auto_find_batch_size`: False - `full_determinism`: False - `torchdynamo`: None - `ray_scope`: last - `ddp_timeout`: 1800 - `torch_compile`: False - `torch_compile_backend`: None - `torch_compile_mode`: None - `dispatch_batches`: None - `split_batches`: None - `include_tokens_per_second`: False - `include_num_input_tokens_seen`: False - `neftune_noise_alpha`: None - `optim_target_modules`: None - `batch_eval_metrics`: False - `eval_on_start`: False - `use_liger_kernel`: False - `eval_use_gather_object`: False - `average_tokens_across_devices`: False - `prompts`: None - `batch_sampler`: batch_sampler - `multi_dataset_batch_sampler`: proportional
### Training Logs
Click to expand | Epoch | Step | Training Loss | |:------:|:----:|:-------------:| | 0.0256 | 1 | 2.3632 | | 0.0513 | 2 | 2.3367 | | 0.0769 | 3 | 2.448 | | 0.1026 | 4 | 2.4189 | | 0.1282 | 5 | 2.1217 | | 0.1538 | 6 | 2.1491 | | 0.1795 | 7 | 1.9582 | | 0.2051 | 8 | 1.9204 | | 0.2308 | 9 | 1.6757 | | 0.2564 | 10 | 1.4951 | | 0.2821 | 11 | 1.3773 | | 0.3077 | 12 | 1.1778 | | 0.3333 | 13 | 1.088 | | 0.3590 | 14 | 1.0256 | | 0.3846 | 15 | 1.0174 | | 0.4103 | 16 | 0.8424 | | 0.4359 | 17 | 0.9435 | | 0.4615 | 18 | 0.854 | | 0.4872 | 19 | 0.8846 | | 0.5128 | 20 | 0.9211 | | 0.5385 | 21 | 0.7185 | | 0.5641 | 22 | 0.8183 | | 0.5897 | 23 | 0.7488 | | 0.6154 | 24 | 0.696 | | 0.6410 | 25 | 0.6371 | | 0.6667 | 26 | 0.6456 | | 0.6923 | 27 | 0.6259 | | 0.7179 | 28 | 0.5277 | | 0.7436 | 29 | 0.7078 | | 0.7692 | 30 | 0.7901 | | 0.7949 | 31 | 0.6332 | | 0.8205 | 32 | 0.4658 | | 0.8462 | 33 | 0.6804 | | 0.8718 | 34 | 0.6232 | | 0.8974 | 35 | 0.611 | | 0.9231 | 36 | 0.6147 | | 0.9487 | 37 | 0.5991 | | 0.9744 | 38 | 0.6732 | | 1.0 | 39 | 0.5281 | | 1.0256 | 40 | 0.5556 | | 1.0513 | 41 | 0.4985 | | 1.0769 | 42 | 0.5527 | | 1.1026 | 43 | 0.4919 | | 1.1282 | 44 | 0.5443 | | 1.1538 | 45 | 0.6086 | | 1.1795 | 46 | 0.5949 | | 1.2051 | 47 | 0.5734 | | 1.2308 | 48 | 0.6677 | | 1.2564 | 49 | 0.5189 | | 1.2821 | 50 | 0.666 | | 1.3077 | 51 | 0.4927 | | 1.3333 | 52 | 0.5356 | | 1.3590 | 53 | 0.5792 | | 1.3846 | 54 | 0.4162 | | 1.4103 | 55 | 0.5923 | | 1.4359 | 56 | 0.4905 | | 1.4615 | 57 | 0.4645 | | 1.4872 | 58 | 0.7121 | | 1.5128 | 59 | 0.5809 | | 1.5385 | 60 | 0.4401 | | 1.5641 | 61 | 0.458 | | 1.5897 | 62 | 0.4659 | | 1.6154 | 63 | 0.5638 | | 1.6410 | 64 | 0.4875 | | 1.6667 | 65 | 0.4903 | | 1.6923 | 66 | 0.5373 | | 1.7179 | 67 | 0.3934 | | 1.7436 | 68 | 0.5693 | | 1.7692 | 69 | 0.4524 | | 1.7949 | 70 | 0.4949 | | 1.8205 | 71 | 0.466 | | 1.8462 | 72 | 0.4837 | | 1.8718 | 73 | 0.5391 | | 1.8974 | 74 | 0.5266 | | 1.9231 | 75 | 0.4747 | | 1.9487 | 76 | 0.4502 | | 1.9744 | 77 | 0.5449 | | 2.0 | 78 | 0.4349 | | 2.0256 | 79 | 0.4566 | | 2.0513 | 80 | 0.482 | | 2.0769 | 81 | 0.5553 | | 2.1026 | 82 | 0.4606 | | 2.1282 | 83 | 0.4938 | | 2.1538 | 84 | 0.4303 | | 2.1795 | 85 | 0.4068 | | 2.2051 | 86 | 0.4398 | | 2.2308 | 87 | 0.4359 | | 2.2564 | 88 | 0.4599 | | 2.2821 | 89 | 0.4835 | | 2.3077 | 90 | 0.404 | | 2.3333 | 91 | 0.5046 | | 2.3590 | 92 | 0.4678 | | 2.3846 | 93 | 0.3891 | | 2.4103 | 94 | 0.435 | | 2.4359 | 95 | 0.5688 | | 2.4615 | 96 | 0.4319 | | 2.4872 | 97 | 0.4667 | | 2.5128 | 98 | 0.5857 | | 2.5385 | 99 | 0.5194 | | 2.5641 | 100 | 0.4741 | | 2.5897 | 101 | 0.5226 | | 2.6154 | 102 | 0.4168 | | 2.6410 | 103 | 0.4488 | | 2.6667 | 104 | 0.4922 | | 2.6923 | 105 | 0.4309 | | 2.7179 | 106 | 0.4832 | | 2.7436 | 107 | 0.4496 | | 2.7692 | 108 | 0.5548 | | 2.7949 | 109 | 0.4355 | | 2.8205 | 110 | 0.4305 | | 2.8462 | 111 | 0.3955 | | 2.8718 | 112 | 0.2876 | | 2.8974 | 113 | 0.4263 | | 2.9231 | 114 | 0.4874 | | 2.9487 | 115 | 0.4602 | | 2.9744 | 116 | 0.4725 | | 3.0 | 117 | 0.5401 |
### Framework Versions - Python: 3.12.3 - Sentence Transformers: 4.0.2 - PyLate: 1.2.0 - Transformers: 4.48.2 - PyTorch: 2.10.0a0+a36e1d39eb.nv26.01.42222806 - Accelerate: 1.13.0 - Datasets: 4.4.2 - Tokenizers: 0.21.4 ## Citation ### BibTeX #### Sentence Transformers ```bibtex @inproceedings{reimers-2019-sentence-bert, title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks", author = "Reimers, Nils and Gurevych, Iryna", booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing", month = "11", year = "2019", publisher = "Association for Computational Linguistics", url = "https://arxiv.org/abs/1908.10084" } ``` #### PyLate ```bibtex @misc{PyLate, title={PyLate: Flexible Training and Retrieval for Late Interaction Models}, author={Chaffin, Antoine and Sourty, Raphaƫl}, url={https://github.com/lightonai/pylate}, year={2024} } ``` #### CachedContrastive ```bibtex @misc{gao2021scaling, title={Scaling Deep Contrastive Learning Batch Size under Memory Limited Setup}, author={Luyu Gao and Yunyi Zhang and Jiawei Han and Jamie Callan}, year={2021}, eprint={2101.06983}, archivePrefix={arXiv}, primaryClass={cs.LG} } ```