| --- |
| tags: |
| - ColBERT |
| - PyLate |
| - sentence-similarity |
| - feature-extraction |
| - generated_from_trainer |
| - dataset_size:909188 |
| - loss:Contrastive |
| base_model: EuroBERT/EuroBERT-210m |
| datasets: |
| - baconnier/rag-comprehensive-triplets |
| pipeline_tag: sentence-similarity |
| library_name: PyLate |
| metrics: |
| - accuracy |
| model-index: |
| - name: PyLate model based on EuroBERT/EuroBERT-210m |
| results: |
| - task: |
| type: col-berttriplet |
| name: Col BERTTriplet |
| dataset: |
| name: Unknown |
| type: unknown |
| metrics: |
| - type: accuracy |
| value: 0.9848384857177734 |
| name: Accuracy |
| license: apache-2.0 |
| language: |
| - es |
| - en |
| --- |
| [<img src="https://cdn-avatars.huggingface.co/v1/production/uploads/67b2f4e49edebc815a3a4739/R1g957j1aBbx8lhZbWmxw.jpeg" width="200"/>](https://huggingface.co/fjmgAI) |
|
|
| ## Fine-Tuned Model |
|
|
| **`fjmgAI/col1-210M-EuroBERT`** |
|
|
| ## Base Model |
| **`EuroBERT/EuroBERT-210m`** |
|
|
| ## Fine-Tuning Method |
| Fine-tuning was performed using **[PyLate](https://github.com/lightonai/pylate)**, with contrastive training on the [rag-comprehensive-triplets](https://huggingface.co/datasets/baconnier/rag-comprehensive-triplets) dataset. It maps sentences & paragraphs to sequences of 128-dimensional dense vectors and can be used for semantic textual similarity using the MaxSim operator. |
|
|
| ## Dataset |
| **[`baconnier/rag-comprehensive-triplets`](https://huggingface.co/datasets/baconnier/rag-comprehensive-triplets)** |
|
|
| ### Description |
| This dataset has been filtered for the Spanish language containing **303,000 examples**, designed for **rag-comprehensive-triplets**. |
|
|
| ## Fine-Tuning Details |
| - The model was trained using the **Contrastive Training**. |
| - Evaluated with <code>pylate.evaluation.colbert_triplet.ColBERTTripletEvaluator</code> |
| |
| | Metric | Value | |
| |:-------------|:-----------| |
| | **accuracy** | **0.9848** | |
| |
| ## Usage |
| First install the PyLate library: |
| |
| ```bash |
| pip install -U pylate |
| ``` |
| |
| ### Calculate Similarity |
| |
| ```python |
| import torch |
| from pylate import models |
| |
| # Load the ColBERT model |
| model = models.ColBERT("fjmgAI/col1-210M-EuroBERT", trust_remote_code=True) |
| |
| # Move the model to GPU if available, otherwise use CPU |
| device = torch.device("cuda" if torch.cuda.is_available() else "cpu") |
| model.to(device) |
|
|
| # Example data for similarity comparison |
| query = "¿Cuál es la capital de España?" # Query sentence |
| positive_doc = "La capital de España es Madrid." # Relevant document |
| negative_doc = "Florida es un estado en los Estados Unidos." # Irrelevant document |
| sentences = [query, positive_doc, negative_doc] # Combine all texts |
|
|
| # Tokenize the input sentences using ColBERT's tokenizer |
| inputs = model.tokenize(sentences) |
|
|
| # Move all input tensors to the same device as the model (GPU/CPU) |
| inputs = {key: value.to(device) for key, value in inputs.items()} |
|
|
| # Generate token embeddings (no gradients needed for inference) |
| with torch.no_grad(): |
| embeddings_dict = model(inputs) |
| embeddings = embeddings_dict['token_embeddings'] |
| |
| # Define ColBERT's MaxSim similarity function |
| def colbert_similarity(query_emb, doc_emb): |
| """ |
| Computes ColBERT-style similarity between query and document embeddings. |
| Uses maximum similarity (MaxSim) between individual tokens. |
| |
| Args: |
| query_emb: [query_tokens, embedding_dim] |
| doc_emb: [doc_tokens, embedding_dim] |
| |
| Returns: |
| Normalized similarity score |
| """ |
| # Compute dot product between all token pairs |
| similarity_matrix = torch.matmul(query_emb, doc_emb.T) |
| |
| # Get maximum similarity for each query token (MaxSim) |
| max_similarities = similarity_matrix.max(dim=1)[0] |
| |
| # Return average of maximum similarities (normalized by query length) |
| return max_similarities.sum() / query_emb.shape[0] |
| |
| # Extract embeddings for each text |
| query_emb = embeddings[0] |
| positive_emb = embeddings[1] |
| negative_emb = embeddings[2] |
| |
| # Compute similarity scores |
| positive_score = colbert_similarity(query_emb, positive_emb) |
| negative_score = colbert_similarity(query_emb, negative_emb) |
| |
| print(f"Similarity with positive document: {positive_score.item():.4f}") |
| print(f"Similarity with negative document: {negative_score.item():.4f}") |
| ``` |
| |
| ## Framework Versions |
| - Python: 3.10.12 |
| - Sentence Transformers: 3.4.1 |
| - PyLate: 1.1.7 |
| - Transformers: 4.48.2 |
| - PyTorch: 2.5.1+cu121 |
| - Accelerate: 1.2.1 |
| - Datasets: 3.3.1 |
| - Tokenizers: 0.21.0 |
| |
| ## Purpose |
| This tuned model is designed for **Spanish applications** that require the use of **efficient semantic search** comparing embeddings at the token level with its MaxSim operation, ideal for **question-answering and document retrieval**. |
| |
| |
| - **Developed by:** fjmgAI |
| - **License:** apache-2.0 |
| |
| [<img src="https://github.com/lightonai/pylate/blob/main/docs/img/logo.png?raw=true" width="200"/>](https://github.com/lightonai/pylate) |