---
tags:
- ColBERT
- PyLate
- sentence-transformers
- sentence-similarity
- feature-extraction
- generated_from_trainer
- dataset_size:9959
- loss:CachedContrastive
pipeline_tag: sentence-similarity
library_name: PyLate
---
# PyLate
This is a [PyLate](https://github.com/lightonai/pylate) model trained. It maps sentences & paragraphs to sequences of 128-dimensional dense vectors and can be used for semantic textual similarity using the MaxSim operator.
## Model Details
### Model Description
- **Model Type:** PyLate model
- **Document Length:** 512 tokens
- **Query Length:** 128 tokens
- **Output Dimensionality:** 128 tokens
- **Similarity Function:** MaxSim
### Model Sources
- **Documentation:** [PyLate Documentation](https://lightonai.github.io/pylate/)
- **Repository:** [PyLate on GitHub](https://github.com/lightonai/pylate)
- **Hugging Face:** [PyLate models on Hugging Face](https://huggingface.co/models?library=PyLate)
### Full Model Architecture
```
ColBERT(
(0): Transformer({'max_seq_length': 127, 'do_lower_case': False}) with Transformer model: ModernBertModel
(1): Dense({'in_features': 768, 'out_features': 128, 'bias': False, 'activation_function': 'torch.nn.modules.linear.Identity'})
)
```
## Usage
First install the PyLate library:
```bash
pip install -U pylate
```
### Retrieval
PyLate provides a streamlined interface to index and retrieve documents using ColBERT models. The index leverages the Voyager HNSW index to efficiently handle document embeddings and enable fast retrieval.
#### Indexing documents
First, load the ColBERT model and initialize the Voyager index, then encode and index your documents:
```python
from pylate import indexes, models, retrieve
# Step 1: Load the ColBERT model
model = models.ColBERT(
model_name_or_path=pylate_model_id,
)
# Step 2: Initialize the Voyager index
index = indexes.Voyager(
index_folder="pylate-index",
index_name="index",
override=True, # This overwrites the existing index if any
)
# Step 3: Encode the documents
documents_ids = ["1", "2", "3"]
documents = ["document 1 text", "document 2 text", "document 3 text"]
documents_embeddings = model.encode(
documents,
batch_size=32,
is_query=False, # Ensure that it is set to False to indicate that these are documents, not queries
show_progress_bar=True,
)
# Step 4: Add document embeddings to the index by providing embeddings and corresponding ids
index.add_documents(
documents_ids=documents_ids,
documents_embeddings=documents_embeddings,
)
```
Note that you do not have to recreate the index and encode the documents every time. Once you have created an index and added the documents, you can re-use the index later by loading it:
```python
# To load an index, simply instantiate it with the correct folder/name and without overriding it
index = indexes.Voyager(
index_folder="pylate-index",
index_name="index",
)
```
#### Retrieving top-k documents for queries
Once the documents are indexed, you can retrieve the top-k most relevant documents for a given set of queries.
To do so, initialize the ColBERT retriever with the index you want to search in, encode the queries and then retrieve the top-k documents to get the top matches ids and relevance scores:
```python
# Step 1: Initialize the ColBERT retriever
retriever = retrieve.ColBERT(index=index)
# Step 2: Encode the queries
queries_embeddings = model.encode(
["query for document 3", "query for document 1"],
batch_size=32,
is_query=True, # # Ensure that it is set to False to indicate that these are queries
show_progress_bar=True,
)
# Step 3: Retrieve top-k documents
scores = retriever.retrieve(
queries_embeddings=queries_embeddings,
k=10, # Retrieve the top 10 matches for each query
)
```
### Reranking
If you only want to use the ColBERT model to perform reranking on top of your first-stage retrieval pipeline without building an index, you can simply use rank function and pass the queries and documents to rerank:
```python
from pylate import rank, models
queries = [
"query A",
"query B",
]
documents = [
["document A", "document B"],
["document 1", "document C", "document B"],
]
documents_ids = [
[1, 2],
[1, 3, 2],
]
model = models.ColBERT(
model_name_or_path=pylate_model_id,
)
queries_embeddings = model.encode(
queries,
is_query=True,
)
documents_embeddings = model.encode(
documents,
is_query=False,
)
reranked_documents = rank.rerank(
documents_ids=documents_ids,
queries_embeddings=queries_embeddings,
documents_embeddings=documents_embeddings,
)
```
## Training Details
### Training Dataset
#### Unnamed Dataset
* Size: 9,959 training samples
* Columns: query, positive, and negative
* Approximate statistics based on the first 1000 samples:
| | query | positive | negative |
|:--------|:-------------------------------------------------------------------------------------|:-------------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------|
| type | string | string | string |
| details |
Here is the step-by-step reasoning to identify the correct code solution for reading an OVF descriptor file with robust error handling.
### 1. Identify the Kind of Code
The code required is a **Python utility function** (or a small script) that performs **file I/O operations**. Specifically, it needs to:
* Accept a file path as an input argument.
* Attempt to open and read the contents of a file (likely a text-based XML or text file, as OVF descriptors are XML).
* Implement **exception handling** to gracefully manage scenarios where the file does not exist or cannot be read due to permissions or corruption.
* Return the file content (string) or a parsed object (if XML parsing is included), or raise a specific, user-friendly error.
### 2. Relevant Programming Concepts & Patterns
* **File I/O and Context Managers**: The code must use the `with open(...)` statement. This ensures the file handle is properly closed even if an error occurs during reading, preventing resource leak... | def get_ovf_descriptor(ovf_path):
if path.exists(ovf_path):
with open(ovf_path, 'r') as f:
try:
ovfd = f.read()
f.close()
return ovfd
except:
print "Could not read file: %s" % ovf_path
exit(1) | def read_vnf_descriptor(vnfd_id, vnf_vendor, vnf_version):
if _catalog_backend is not None:
return _catalog_backend.read_vnf_descriptor(vnfd_id, vnf_vendor,
vnf_version)
return None |
| Here is the step-by-step reasoning to identify the correct code solution for adding a custom 'Settings' link to the WordPress plugin action links.
### 1. What kind of code would answer this query?
The solution requires **PHP code** specifically designed for **WordPress plugin development**. It will not be a JavaScript snippet or a CSS style. The code must be a function that hooks into the WordPress plugin management system, likely using the `plugin_action_links_{plugin_basename}` filter.
### 2. Relevant Programming Concepts, Patterns, and Algorithms
* **WordPress Hooks (Filters):** The core mechanism is the `apply_filters()` system. Specifically, the dynamic filter `plugin_action_links_{plugin_basename}` allows developers to modify the array of action links (Activate, Deactivate, Edit, Delete, Settings) for a specific plugin.
* **Array Manipulation:** The action links are stored as an associative array where the key is the link text (or ID) and the value is the URL. The code must... | public
function plugin_add_settings_link(
$links
) {
$settings_link_html = '' . __( 'Settings', 'link-linkid' ) . '';
array_unshift( $links, $settings_link_html );
return $links;
} | function plugin_settings_link( $links){
$settings_link = 'Settings';
array_unshift($links, $settings_link);
return $links;
} |
| ### Reasoning Chain
1. **Identify the Goal**: The user wants to parse a JSON Web Token (JWT) in Go specifically to read the payload (claims) *without* performing the cryptographic signature verification. This is often needed for debugging, logging, or when the token is trusted from a different source (e.g., a trusted internal service) and signature validation is handled elsewhere.
2. **Analyze the JWT Structure**: A JWT consists of three parts: `header.payload.signature`. The `payload` is a JSON object containing the claims. To extract claims without verification, we need to:
* Decode the Base64URL-encoded payload.
* Unmarshal the JSON into a Go struct or `map[string]interface{}`.
* **Crucially**, skip the step where the library checks the signature against the provided key.
3. **Select the Library**: The standard library for JWT in Go is `github.com/golang-jwt/jwt/v5` (or the older `v4`). The older `jwt-go` library is deprecated.
4. **Determine the Implementa... | func ParseInsecure(token string, audience []string) (*SVID, error) {
return parse(token, audience, func(tok *jwt.JSONWebToken, td spiffeid.TrustDomain) (map[string]interface{}, error) {
// Obtain the token claims insecurely, i.e. without signature verification
claimsMap := make(map[string]interface{})
if err := tok.UnsafeClaimsWithoutVerification(&claimsMap); err != nil {
return nil, jwtsvidErr.New("unable to get claims from token: %v", err)
}
return claimsMap, nil
})
} | func ParseAndValidate(token string, bundles jwtbundle.Source, audience []string) (*SVID, error) {
return parse(token, audience, func(tok *jwt.JSONWebToken, trustDomain spiffeid.TrustDomain) (map[string]interface{}, error) {
// Obtain the key ID from the header
keyID := tok.Headers[0].KeyID
if keyID == "" {
return nil, jwtsvidErr.New("token header missing key id")
}
// Get JWT Bundle
bundle, err := bundles.GetJWTBundleForTrustDomain(trustDomain)
if err != nil {
return nil, jwtsvidErr.New("no bundle found for trust domain %q", trustDomain)
}
// Find JWT authority using the key ID from the token header
authority, ok := bundle.FindJWTAuthority(keyID)
if !ok {
return nil, jwtsvidErr.New("no JWT authority %q found for trust domain %q", keyID, trustDomain)
}
// Obtain and verify the token claims using the obtained JWT authority
claimsMap := make(map[string]interface{})
if err := tok.Claims(authority, &claimsMap); err != nil {
return nil, jwtsvidEr... |
* Loss: pylate.losses.cached_contrastive.CachedContrastive
### Training Hyperparameters
#### Non-Default Hyperparameters
- `per_device_train_batch_size`: 256
- `per_device_eval_batch_size`: 256
- `learning_rate`: 5e-06
- `warmup_ratio`: 0.05
- `bf16`: True
- `tf32`: True
- `dataloader_num_workers`: 8
- `dataloader_prefetch_factor`: 4
- `dataloader_persistent_workers`: True
#### All Hyperparameters