raialvaro commited on
Commit
fa90928
·
verified ·
1 Parent(s): 47a1ead

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +88 -467
README.md CHANGED
@@ -2,13 +2,12 @@
2
  tags:
3
  - ColBERT
4
  - PyLate
5
- - sentence-transformers
6
  - sentence-similarity
7
  - feature-extraction
8
  - generated_from_trainer
9
  - dataset_size:909188
10
  - loss:Contrastive
11
- base_model: colbert-ir/colbertv2.0
12
  datasets:
13
  - baconnier/rag-comprehensive-triplets
14
  pipeline_tag: sentence-similarity
@@ -16,7 +15,7 @@ library_name: PyLate
16
  metrics:
17
  - accuracy
18
  model-index:
19
- - name: PyLate model based on colbert-ir/colbertv2.0
20
  results:
21
  - task:
22
  type: col-berttriplet
@@ -28,40 +27,35 @@ model-index:
28
  - type: accuracy
29
  value: 0.9841766953468323
30
  name: Accuracy
 
 
 
 
31
  ---
32
 
33
- # PyLate model based on colbert-ir/colbertv2.0
34
 
35
- This is a [PyLate](https://github.com/lightonai/pylate) model finetuned from [colbert-ir/colbertv2.0](https://huggingface.co/colbert-ir/colbertv2.0) on the [rag-comprehensive-triplets](https://huggingface.co/datasets/baconnier/rag-comprehensive-triplets) dataset. It maps sentences & paragraphs to sequences of 128-dimensional dense vectors and can be used for semantic textual similarity using the MaxSim operator.
36
 
37
- ## Model Details
 
38
 
39
- ### Model Description
40
- - **Model Type:** PyLate model
41
- - **Base model:** [colbert-ir/colbertv2.0](https://huggingface.co/colbert-ir/colbertv2.0) <!-- at revision c1e84128e85ef755c096a95bdb06b47793b13acf -->
42
- - **Document Length:** 180 tokens
43
- - **Query Length:** 32 tokens
44
- - **Output Dimensionality:** 128 tokens
45
- - **Similarity Function:** MaxSim
46
- - **Training Dataset:**
47
- - [rag-comprehensive-triplets](https://huggingface.co/datasets/baconnier/rag-comprehensive-triplets)
48
- <!-- - **Language:** Unknown -->
49
- <!-- - **License:** Unknown -->
50
 
51
- ### Model Sources
 
52
 
53
- - **Documentation:** [PyLate Documentation](https://lightonai.github.io/pylate/)
54
- - **Repository:** [PyLate on GitHub](https://github.com/lightonai/pylate)
55
- - **Hugging Face:** [PyLate models on Hugging Face](https://huggingface.co/models?library=PyLate)
56
 
57
- ### Full Model Architecture
 
 
58
 
59
- ```
60
- ColBERT(
61
- (0): Transformer({'max_seq_length': 179, 'do_lower_case': False}) with Transformer model: BertModel
62
- (1): Dense({'in_features': 768, 'out_features': 128, 'bias': False, 'activation_function': 'torch.nn.modules.linear.Identity'})
63
- )
64
- ```
65
 
66
  ## Usage
67
  First install the PyLate library:
@@ -70,406 +64,72 @@ First install the PyLate library:
70
  pip install -U pylate
71
  ```
72
 
73
- ### Retrieval
74
-
75
- PyLate provides a streamlined interface to index and retrieve documents using ColBERT models. The index leverages the Voyager HNSW index to efficiently handle document embeddings and enable fast retrieval.
76
-
77
- #### Indexing documents
78
-
79
- First, load the ColBERT model and initialize the Voyager index, then encode and index your documents:
80
-
81
- ```python
82
- from pylate import indexes, models, retrieve
83
-
84
- # Step 1: Load the ColBERT model
85
- model = models.ColBERT(
86
- model_name_or_path=pylate_model_id,
87
- )
88
-
89
- # Step 2: Initialize the Voyager index
90
- index = indexes.Voyager(
91
- index_folder="pylate-index",
92
- index_name="index",
93
- override=True, # This overwrites the existing index if any
94
- )
95
-
96
- # Step 3: Encode the documents
97
- documents_ids = ["1", "2", "3"]
98
- documents = ["document 1 text", "document 2 text", "document 3 text"]
99
-
100
- documents_embeddings = model.encode(
101
- documents,
102
- batch_size=32,
103
- is_query=False, # Ensure that it is set to False to indicate that these are documents, not queries
104
- show_progress_bar=True,
105
- )
106
-
107
- # Step 4: Add document embeddings to the index by providing embeddings and corresponding ids
108
- index.add_documents(
109
- documents_ids=documents_ids,
110
- documents_embeddings=documents_embeddings,
111
- )
112
- ```
113
-
114
- Note that you do not have to recreate the index and encode the documents every time. Once you have created an index and added the documents, you can re-use the index later by loading it:
115
 
116
  ```python
117
- # To load an index, simply instantiate it with the correct folder/name and without overriding it
118
- index = indexes.Voyager(
119
- index_folder="pylate-index",
120
- index_name="index",
121
- )
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
122
  ```
123
 
124
- #### Retrieving top-k documents for queries
125
-
126
- Once the documents are indexed, you can retrieve the top-k most relevant documents for a given set of queries.
127
- To do so, initialize the ColBERT retriever with the index you want to search in, encode the queries and then retrieve the top-k documents to get the top matches ids and relevance scores:
128
-
129
- ```python
130
- # Step 1: Initialize the ColBERT retriever
131
- retriever = retrieve.ColBERT(index=index)
132
-
133
- # Step 2: Encode the queries
134
- queries_embeddings = model.encode(
135
- ["query for document 3", "query for document 1"],
136
- batch_size=32,
137
- is_query=True, # # Ensure that it is set to False to indicate that these are queries
138
- show_progress_bar=True,
139
- )
140
-
141
- # Step 3: Retrieve top-k documents
142
- scores = retriever.retrieve(
143
- queries_embeddings=queries_embeddings,
144
- k=10, # Retrieve the top 10 matches for each query
145
- )
146
- ```
147
-
148
- ### Reranking
149
- If you only want to use the ColBERT model to perform reranking on top of your first-stage retrieval pipeline without building an index, you can simply use rank function and pass the queries and documents to rerank:
150
-
151
- ```python
152
- from pylate import rank, models
153
-
154
- queries = [
155
- "query A",
156
- "query B",
157
- ]
158
-
159
- documents = [
160
- ["document A", "document B"],
161
- ["document 1", "document C", "document B"],
162
- ]
163
-
164
- documents_ids = [
165
- [1, 2],
166
- [1, 3, 2],
167
- ]
168
-
169
- model = models.ColBERT(
170
- model_name_or_path=pylate_model_id,
171
- )
172
-
173
- queries_embeddings = model.encode(
174
- queries,
175
- is_query=True,
176
- )
177
-
178
- documents_embeddings = model.encode(
179
- documents,
180
- is_query=False,
181
- )
182
-
183
- reranked_documents = rank.rerank(
184
- documents_ids=documents_ids,
185
- queries_embeddings=queries_embeddings,
186
- documents_embeddings=documents_embeddings,
187
- )
188
- ```
189
-
190
- <!--
191
- ### Direct Usage (Transformers)
192
-
193
- <details><summary>Click to see the direct usage in Transformers</summary>
194
-
195
- </details>
196
- -->
197
-
198
- <!--
199
- ### Downstream Usage (Sentence Transformers)
200
-
201
- You can finetune this model on your own dataset.
202
-
203
- <details><summary>Click to expand</summary>
204
-
205
- </details>
206
- -->
207
-
208
- <!--
209
- ### Out-of-Scope Use
210
-
211
- *List how the model may foreseeably be misused and address what users ought not to do with the model.*
212
- -->
213
-
214
- ## Evaluation
215
-
216
- ### Metrics
217
-
218
- #### Col BERTTriplet
219
-
220
- * Evaluated with <code>pylate.evaluation.colbert_triplet.ColBERTTripletEvaluator</code>
221
-
222
- | Metric | Value |
223
- |:-------------|:-----------|
224
- | **accuracy** | **0.9842** |
225
-
226
- <!--
227
- ## Bias, Risks and Limitations
228
-
229
- *What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
230
- -->
231
-
232
- <!--
233
- ### Recommendations
234
-
235
- *What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
236
- -->
237
-
238
- ## Training Details
239
-
240
- ### Training Dataset
241
-
242
- #### rag-comprehensive-triplets
243
-
244
- * Dataset: [rag-comprehensive-triplets](https://huggingface.co/datasets/baconnier/rag-comprehensive-triplets) at [678e83e](https://huggingface.co/datasets/baconnier/rag-comprehensive-triplets/tree/678e83ed6a74d17c38b33344168abc7787e39754)
245
- * Size: 909,188 training samples
246
- * Columns: <code>query</code>, <code>positive</code>, <code>negative</code>, <code>original_id</code>, <code>dataset_source</code>, <code>category</code>, and <code>language</code>
247
- * Approximate statistics based on the first 1000 samples:
248
- | | query | positive | negative | original_id | dataset_source | category | language |
249
- |:--------|:----------------------------------------------------------------------------------|:----------------------------------------------------------------------------------|:----------------------------------------------------------------------------------|:--------------------------------------------------------------------------------|:----------------------------------------------------------------------------------|:--------------------------------------------------------------------------------|:-------------------------------------------------------------------------------|
250
- | type | string | string | string | string | string | string | string |
251
- | details | <ul><li>min: 9 tokens</li><li>mean: 26.32 tokens</li><li>max: 32 tokens</li></ul> | <ul><li>min: 4 tokens</li><li>mean: 29.53 tokens</li><li>max: 32 tokens</li></ul> | <ul><li>min: 6 tokens</li><li>mean: 30.37 tokens</li><li>max: 32 tokens</li></ul> | <ul><li>min: 4 tokens</li><li>mean: 5.22 tokens</li><li>max: 6 tokens</li></ul> | <ul><li>min: 21 tokens</li><li>mean: 21.0 tokens</li><li>max: 21 tokens</li></ul> | <ul><li>min: 4 tokens</li><li>mean: 6.25 tokens</li><li>max: 7 tokens</li></ul> | <ul><li>min: 4 tokens</li><li>mean: 4.0 tokens</li><li>max: 4 tokens</li></ul> |
252
- * Samples:
253
- | query | positive | negative | original_id | dataset_source | category | language |
254
- |:---------------------------------------------------------------------------------------------------------|:---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:-------------------|:---------------------------------------------------------------|:-----------------------|:----------------|
255
- | <code>Dime los nombres de cinco equipos deportivos profesionales que están ubicados en Nueva York</code> | <code>Los equipos deportivos profesionales del estado de Nueva York son los Yankees de Nueva York, los Mets de Nueva York, los Islanders de Nueva York, los Nets de Brooklyn y los Knicks de Nueva York</code> | <code>En el estado de Nueva York, los cinco equipos deportivos profesionales son los Yankees, los Mets, los Rangers, los Nets y los Knicks</code> | <code>12558</code> | <code>argilla/databricks-dolly-15k-curated-multilingual</code> | <code>open_qa</code> | <code>es</code> |
256
- | <code>¿Qué significa eso de MBB y eso que tiene que ver con las consultoras?</code> | <code>McKinsey & Company es una empresa global de consultoría de gestión fundada en 1926 por el profesor de la Universidad de Chicago James O. McKinsey, que ofrece servicios profesionales a empresas, gobiernos y otras organizaciones. McKinsey es la mayor y más antigua de las "Tres Grandes" consultoras de gestión (MBB), las tres mayores consultoras de estrategia del mundo por ingresos. La empresa se centra principalmente en las finanzas y operaciones de sus clientes.<br><br>Bajo la dirección de Marvin Bower, McKinsey se expandió por Europa durante las décadas de 1940 y 1950. En los años 60, Fred Gluck, de McKinsey, junto con Bruce Henderson, de Boston Consulting Group, Bill Bain, de Bain & Company, y Michael Porter, de Harvard Business School, transformaron la cultura empresarial Una publicación de 1975 de John L. Neuman, de McKinsey, introdujo la práctica empresarial del "análisis del valor de los gastos generales", que contribuyó a una tendencia a la reducción que eliminó muchos pue...</code> | <code>McKinsey & Company es una empresa global de servicios financieros fundada en 1926 por James O. McKinsey, que ofrece servicios de consultoría a empresas y gobiernos.</code> | <code>11131</code> | <code>argilla/databricks-dolly-15k-curated-multilingual</code> | <code>closed_qa</code> | <code>es</code> |
257
- | <code>Tour de france desde 1903 hasta ahora, ¿alguna vez no ha habido la carrera?</code> | <code>El tour de france se celebra desde 1903, y las únicas cancelaciones destacadas se produjeron con motivo de las dos guerras mundiales</code> | <code>El tour de france ha corrido todos los años desde 1903, excepto durante la Primera Guerra Mundial.</code> | <code>14640</code> | <code>argilla/databricks-dolly-15k-curated-multilingual</code> | <code>closed_qa</code> | <code>es</code> |
258
- * Loss: <code>pylate.losses.contrastive.Contrastive</code>
259
-
260
- ### Evaluation Dataset
261
-
262
- #### rag-comprehensive-triplets
263
-
264
- * Dataset: [rag-comprehensive-triplets](https://huggingface.co/datasets/baconnier/rag-comprehensive-triplets) at [678e83e](https://huggingface.co/datasets/baconnier/rag-comprehensive-triplets/tree/678e83ed6a74d17c38b33344168abc7787e39754)
265
- * Size: 909,188 evaluation samples
266
- * Columns: <code>query</code>, <code>positive</code>, <code>negative</code>, <code>original_id</code>, <code>dataset_source</code>, <code>category</code>, and <code>language</code>
267
- * Approximate statistics based on the first 1000 samples:
268
- | | query | positive | negative | original_id | dataset_source | category | language |
269
- |:--------|:----------------------------------------------------------------------------------|:----------------------------------------------------------------------------------|:----------------------------------------------------------------------------------|:--------------------------------------------------------------------------------|:----------------------------------------------------------------------------------|:--------------------------------------------------------------------------------|:-------------------------------------------------------------------------------|
270
- | type | string | string | string | string | string | string | string |
271
- | details | <ul><li>min: 9 tokens</li><li>mean: 26.28 tokens</li><li>max: 32 tokens</li></ul> | <ul><li>min: 4 tokens</li><li>mean: 28.98 tokens</li><li>max: 32 tokens</li></ul> | <ul><li>min: 6 tokens</li><li>mean: 29.95 tokens</li><li>max: 32 tokens</li></ul> | <ul><li>min: 4 tokens</li><li>mean: 5.18 tokens</li><li>max: 6 tokens</li></ul> | <ul><li>min: 21 tokens</li><li>mean: 21.0 tokens</li><li>max: 21 tokens</li></ul> | <ul><li>min: 4 tokens</li><li>mean: 6.23 tokens</li><li>max: 7 tokens</li></ul> | <ul><li>min: 4 tokens</li><li>mean: 4.0 tokens</li><li>max: 4 tokens</li></ul> |
272
- * Samples:
273
- | query | positive | negative | original_id | dataset_source | category | language |
274
- |:------------------------------------------------------------------------------------------------------------------------------------------------|:---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:-------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:------------------|:---------------------------------------------------------------|:------------------------------------|:----------------|
275
- | <code>Dado el texto, ¿cuál es el tamaño de la población del país de Perú?</code> | <code>Perú tiene una población de más de 34 millones de habitantes.</code> | <code>La población del país de Perú es de más de 32 millones de habitantes.</code> | <code>7265</code> | <code>argilla/databricks-dolly-15k-curated-multilingual</code> | <code>closed_qa</code> | <code>es</code> |
276
- | <code>¿Cuál es el significado de 'acoplamiento de mareas'?</code> | <code>El acoplamiento de marea o rotación sincrónica es lo que hace que la cara de un objeto astronómico esté fija apuntando a otro, igual que la cara visible de la Luna está siempre apuntando a la Tierra. Un objeto acoplado tarda el mismo tiempo en girar sobre su eje que en trasladarse alrededor de su compañero. Esta rotación sincrónica, también conocida como rotación capturada o rotación sincrónica, hace que un hemisferio apunte continuamente hacia el objeto compañero. Normalmente, sólo el satélite se acopla alrededor de un planeta mayor, pero si la diferencia de masa entre los dos cuerpos y la distancia entre ellos es pequeña, ambos objetos pueden tener un acoplamiento de marea recíproco, como Plutón y Caronte.</code> | <code>El acoplamiento de mareas es un fenómeno astronómico en el que un objeto gira alrededor de otro objeto, manteniendo siempre la misma distancia hacia él.</code> | <code>2892</code> | <code>argilla/databricks-dolly-15k-curated-multilingual</code> | <code>closed_qa</code> | <code>es</code> |
277
- | <code>¿Trópicos a la nieve fue co-ganadora de la Medalla de Oro del Instituto Australiano de Cinematografía a la Mejor Película en 1965?</code> | <code>Del trópico a la nieve es un cortometraje documental australiano de 1964. Fue una de las películas australianas más conocidas de los años 60. Se produjo bajo los auspicios de la Unidad Cinematográfica de la Commonwealth (CFU), más tarde reincorporada como Film Australia. Fue codirigido por Jack Lee y Richard (Dick) Mason, y contó con Reg Livermore (en su primer papel cinematográfico) como uno de los "narradores".<br><br>Una de las tareas habituales de la UFC en aquella época era la producción de cortometrajes destinados a su distribución en el extranjero y cuyo objetivo era promocionar Australia como destino atractivo para emigrantes y turistas. Del trópico a la nieve supuso una ruptura significativa con el estilo tradicional de este tipo de largometrajes, y destaca por su enfoque subversivo y satírico del tema.<br><br>En lugar de utilizar la narración forzada y autoritaria a una sola voz, típica de los "documentales de promoción" de la época, Mason y Lee optaron por un enfoque so...</code> | <code>Del trópico a la nieve es un documental australiano de 1964 que ganó el premio al mejor director en 1965.</code> | <code>6611</code> | <code>argilla/databricks-dolly-15k-curated-multilingual</code> | <code>information_extraction</code> | <code>es</code> |
278
- * Loss: <code>pylate.losses.contrastive.Contrastive</code>
279
-
280
- ### Training Hyperparameters
281
- #### Non-Default Hyperparameters
282
-
283
- - `eval_strategy`: steps
284
- - `per_device_train_batch_size`: 16
285
- - `per_device_eval_batch_size`: 16
286
- - `learning_rate`: 2e-05
287
- - `num_train_epochs`: 1
288
- - `fp16`: True
289
- - `load_best_model_at_end`: True
290
-
291
- #### All Hyperparameters
292
- <details><summary>Click to expand</summary>
293
-
294
- - `overwrite_output_dir`: False
295
- - `do_predict`: False
296
- - `eval_strategy`: steps
297
- - `prediction_loss_only`: True
298
- - `per_device_train_batch_size`: 16
299
- - `per_device_eval_batch_size`: 16
300
- - `per_gpu_train_batch_size`: None
301
- - `per_gpu_eval_batch_size`: None
302
- - `gradient_accumulation_steps`: 1
303
- - `eval_accumulation_steps`: None
304
- - `torch_empty_cache_steps`: None
305
- - `learning_rate`: 2e-05
306
- - `weight_decay`: 0.0
307
- - `adam_beta1`: 0.9
308
- - `adam_beta2`: 0.999
309
- - `adam_epsilon`: 1e-08
310
- - `max_grad_norm`: 1.0
311
- - `num_train_epochs`: 1
312
- - `max_steps`: -1
313
- - `lr_scheduler_type`: linear
314
- - `lr_scheduler_kwargs`: {}
315
- - `warmup_ratio`: 0.0
316
- - `warmup_steps`: 0
317
- - `log_level`: passive
318
- - `log_level_replica`: warning
319
- - `log_on_each_node`: True
320
- - `logging_nan_inf_filter`: True
321
- - `save_safetensors`: True
322
- - `save_on_each_node`: False
323
- - `save_only_model`: False
324
- - `restore_callback_states_from_checkpoint`: False
325
- - `no_cuda`: False
326
- - `use_cpu`: False
327
- - `use_mps_device`: False
328
- - `seed`: 42
329
- - `data_seed`: None
330
- - `jit_mode_eval`: False
331
- - `use_ipex`: False
332
- - `bf16`: False
333
- - `fp16`: True
334
- - `fp16_opt_level`: O1
335
- - `half_precision_backend`: auto
336
- - `bf16_full_eval`: False
337
- - `fp16_full_eval`: False
338
- - `tf32`: None
339
- - `local_rank`: 0
340
- - `ddp_backend`: None
341
- - `tpu_num_cores`: None
342
- - `tpu_metrics_debug`: False
343
- - `debug`: []
344
- - `dataloader_drop_last`: False
345
- - `dataloader_num_workers`: 0
346
- - `dataloader_prefetch_factor`: None
347
- - `past_index`: -1
348
- - `disable_tqdm`: False
349
- - `remove_unused_columns`: True
350
- - `label_names`: None
351
- - `load_best_model_at_end`: True
352
- - `ignore_data_skip`: False
353
- - `fsdp`: []
354
- - `fsdp_min_num_params`: 0
355
- - `fsdp_config`: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
356
- - `fsdp_transformer_layer_cls_to_wrap`: None
357
- - `accelerator_config`: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
358
- - `deepspeed`: None
359
- - `label_smoothing_factor`: 0.0
360
- - `optim`: adamw_torch
361
- - `optim_args`: None
362
- - `adafactor`: False
363
- - `group_by_length`: False
364
- - `length_column_name`: length
365
- - `ddp_find_unused_parameters`: None
366
- - `ddp_bucket_cap_mb`: None
367
- - `ddp_broadcast_buffers`: False
368
- - `dataloader_pin_memory`: True
369
- - `dataloader_persistent_workers`: False
370
- - `skip_memory_metrics`: True
371
- - `use_legacy_prediction_loop`: False
372
- - `push_to_hub`: False
373
- - `resume_from_checkpoint`: None
374
- - `hub_model_id`: None
375
- - `hub_strategy`: every_save
376
- - `hub_private_repo`: None
377
- - `hub_always_push`: False
378
- - `gradient_checkpointing`: False
379
- - `gradient_checkpointing_kwargs`: None
380
- - `include_inputs_for_metrics`: False
381
- - `include_for_metrics`: []
382
- - `eval_do_concat_batches`: True
383
- - `fp16_backend`: auto
384
- - `push_to_hub_model_id`: None
385
- - `push_to_hub_organization`: None
386
- - `mp_parameters`:
387
- - `auto_find_batch_size`: False
388
- - `full_determinism`: False
389
- - `torchdynamo`: None
390
- - `ray_scope`: last
391
- - `ddp_timeout`: 1800
392
- - `torch_compile`: False
393
- - `torch_compile_backend`: None
394
- - `torch_compile_mode`: None
395
- - `dispatch_batches`: None
396
- - `split_batches`: None
397
- - `include_tokens_per_second`: False
398
- - `include_num_input_tokens_seen`: False
399
- - `neftune_noise_alpha`: None
400
- - `optim_target_modules`: None
401
- - `batch_eval_metrics`: False
402
- - `eval_on_start`: False
403
- - `use_liger_kernel`: False
404
- - `eval_use_gather_object`: False
405
- - `average_tokens_across_devices`: False
406
- - `prompts`: None
407
- - `batch_sampler`: batch_sampler
408
- - `multi_dataset_batch_sampler`: proportional
409
-
410
- </details>
411
-
412
- ### Training Logs
413
- | Epoch | Step | Training Loss | Validation Loss | accuracy |
414
- |:----------:|:--------:|:-------------:|:---------------:|:--------:|
415
- | 0.0538 | 500 | 1.1908 | - | - |
416
- | 0 | 0 | - | - | 0.7445 |
417
- | 0.0538 | 500 | - | 0.8940 | - |
418
- | 0.1076 | 1000 | 0.879 | - | - |
419
- | 0 | 0 | - | - | 0.7974 |
420
- | 0.1076 | 1000 | - | 0.7308 | - |
421
- | 0.1615 | 1500 | 0.7334 | - | - |
422
- | 0 | 0 | - | - | 0.8312 |
423
- | 0.1615 | 1500 | - | 0.6040 | - |
424
- | 0.2153 | 2000 | 0.6319 | - | - |
425
- | 0 | 0 | - | - | 0.8508 |
426
- | 0.2153 | 2000 | - | 0.5380 | - |
427
- | 0.2691 | 2500 | 0.5576 | - | - |
428
- | 0 | 0 | - | - | 0.8831 |
429
- | 0.2691 | 2500 | - | 0.4493 | - |
430
- | 0.3229 | 3000 | 0.5005 | - | - |
431
- | 0 | 0 | - | - | 0.9082 |
432
- | 0.3229 | 3000 | - | 0.3969 | - |
433
- | 0.3767 | 3500 | 0.4591 | - | - |
434
- | 0 | 0 | - | - | 0.9267 |
435
- | 0.3767 | 3500 | - | 0.3430 | - |
436
- | 0.4306 | 4000 | 0.3944 | - | - |
437
- | 0 | 0 | - | - | 0.9293 |
438
- | 0.4306 | 4000 | - | 0.2984 | - |
439
- | 0.4844 | 4500 | 0.3674 | - | - |
440
- | 0 | 0 | - | - | 0.9438 |
441
- | 0.4844 | 4500 | - | 0.2657 | - |
442
- | 0.5382 | 5000 | 0.3351 | - | - |
443
- | 0 | 0 | - | - | 0.9492 |
444
- | 0.5382 | 5000 | - | 0.2365 | - |
445
- | 0.5920 | 5500 | 0.3019 | - | - |
446
- | 0 | 0 | - | - | 0.9614 |
447
- | 0.5920 | 5500 | - | 0.2020 | - |
448
- | 0.6459 | 6000 | 0.2769 | - | - |
449
- | 0 | 0 | - | - | 0.9637 |
450
- | 0.6459 | 6000 | - | 0.1839 | - |
451
- | 0.6997 | 6500 | 0.2578 | - | - |
452
- | 0 | 0 | - | - | 0.9738 |
453
- | 0.6997 | 6500 | - | 0.1623 | - |
454
- | 0.7535 | 7000 | 0.2362 | - | - |
455
- | 0 | 0 | - | - | 0.9761 |
456
- | 0.7535 | 7000 | - | 0.1528 | - |
457
- | 0.8073 | 7500 | 0.2239 | - | - |
458
- | 0 | 0 | - | - | 0.9791 |
459
- | 0.8073 | 7500 | - | 0.1407 | - |
460
- | 0.8611 | 8000 | 0.2069 | - | - |
461
- | 0 | 0 | - | - | 0.9802 |
462
- | 0.8611 | 8000 | - | 0.1339 | - |
463
- | 0.9150 | 8500 | 0.2067 | - | - |
464
- | 0 | 0 | - | - | 0.9830 |
465
- | 0.9150 | 8500 | - | 0.1290 | - |
466
- | **0.9688** | **9000** | **0.1935** | **-** | **-** |
467
- | 0 | 0 | - | - | 0.9842 |
468
- | **0.9688** | **9000** | **-** | **0.1221** | **-** |
469
-
470
- * The bold row denotes the saved checkpoint.
471
-
472
- ### Framework Versions
473
  - Python: 3.10.12
474
  - Sentence Transformers: 3.4.1
475
  - PyLate: 1.1.7
@@ -479,48 +139,9 @@ You can finetune this model on your own dataset.
479
  - Datasets: 3.3.1
480
  - Tokenizers: 0.21.0
481
 
 
 
482
 
483
- ## Citation
484
-
485
- ### BibTeX
486
-
487
- #### Sentence Transformers
488
- ```bibtex
489
- @inproceedings{reimers-2019-sentence-bert,
490
- title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
491
- author = "Reimers, Nils and Gurevych, Iryna",
492
- booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
493
- month = "11",
494
- year = "2019",
495
- publisher = "Association for Computational Linguistics",
496
- url = "https://arxiv.org/abs/1908.10084"
497
- }
498
- ```
499
-
500
- #### PyLate
501
- ```bibtex
502
- @misc{PyLate,
503
- title={PyLate: Flexible Training and Retrieval for Late Interaction Models},
504
- author={Chaffin, Antoine and Sourty, Raphaël},
505
- url={https://github.com/lightonai/pylate},
506
- year={2024}
507
- }
508
- ```
509
-
510
- <!--
511
- ## Glossary
512
-
513
- *Clearly define terms in order to be accessible across audiences.*
514
- -->
515
-
516
- <!--
517
- ## Model Card Authors
518
-
519
- *Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.*
520
- -->
521
-
522
- <!--
523
- ## Model Card Contact
524
 
525
- *Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.*
526
- -->
 
2
  tags:
3
  - ColBERT
4
  - PyLate
 
5
  - sentence-similarity
6
  - feature-extraction
7
  - generated_from_trainer
8
  - dataset_size:909188
9
  - loss:Contrastive
10
+ base_model: EuroBERT/EuroBERT-610m
11
  datasets:
12
  - baconnier/rag-comprehensive-triplets
13
  pipeline_tag: sentence-similarity
 
15
  metrics:
16
  - accuracy
17
  model-index:
18
+ - name: PyLate model based on EuroBERT/EuroBERT-610m
19
  results:
20
  - task:
21
  type: col-berttriplet
 
27
  - type: accuracy
28
  value: 0.9841766953468323
29
  name: Accuracy
30
+ license: apache-2.0
31
+ language:
32
+ - es
33
+ - en
34
  ---
35
 
36
+ ## Fine-Tuned Model
37
 
38
+ **`raialvaro/colbert-610M-EuroBERT`**
39
 
40
+ ## Base Model
41
+ **`EuroBERT/EuroBERT-610m`**
42
 
43
+ ## Fine-Tuning Method
44
+ Fine-tuning was performed using **[PyLate](https://github.com/lightonai/pylate)**, with contrastive training on the [rag-comprehensive-triplets](https://huggingface.co/datasets/baconnier/rag-comprehensive-triplets) dataset. It maps sentences & paragraphs to sequences of 128-dimensional dense vectors and can be used for semantic textual similarity using the MaxSim operator.
 
 
 
 
 
 
 
 
 
45
 
46
+ ## Dataset
47
+ **[`baconnier/rag-comprehensive-triplets`](https://huggingface.co/datasets/baconnier/rag-comprehensive-triplets)**
48
 
49
+ ### Description
50
+ This dataset has been filtered for the Spanish language containing **303,000 examples**, designed for **rag-comprehensive-triplets**.
 
51
 
52
+ ## Fine-Tuning Details
53
+ - The model was trained using the **Contrastive Training**.
54
+ - Evaluated with <code>pylate.evaluation.colbert_triplet.ColBERTTripletEvaluator</code>
55
 
56
+ | Metric | Value |
57
+ |:-------------|:-----------|
58
+ | **accuracy** | **0.98417** |
 
 
 
59
 
60
  ## Usage
61
  First install the PyLate library:
 
64
  pip install -U pylate
65
  ```
66
 
67
+ ### Calculate Similarity
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
68
 
69
  ```python
70
+ import torch
71
+ from pylate import models
72
+
73
+ # Load the ColBERT model
74
+ model = models.ColBERT("raialvaro/colbert-610M-EuroBERT", trust_remote_code=True)
75
+
76
+ # Move the model to GPU if available, otherwise use CPU
77
+ device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
78
+ model.to(device)
79
+
80
+ # Example data for similarity comparison
81
+ query = "¿Cuál es la capital de España?" # Query sentence
82
+ positive_doc = "La capital de España es Madrid." # Relevant document
83
+ negative_doc = "Florida es un estado en los Estados Unidos." # Irrelevant document
84
+ sentences = [query, positive_doc, negative_doc] # Combine all texts
85
+
86
+ # Tokenize the input sentences using ColBERT's tokenizer
87
+ inputs = model.tokenize(sentences)
88
+
89
+ # Move all input tensors to the same device as the model (GPU/CPU)
90
+ inputs = {key: value.to(device) for key, value in inputs.items()}
91
+
92
+ # Generate token embeddings (no gradients needed for inference)
93
+ with torch.no_grad():
94
+ embeddings_dict = model(inputs)
95
+ embeddings = embeddings_dict['token_embeddings']
96
+
97
+ # Define ColBERT's MaxSim similarity function
98
+ def colbert_similarity(query_emb, doc_emb):
99
+ """
100
+ Computes ColBERT-style similarity between query and document embeddings.
101
+ Uses maximum similarity (MaxSim) between individual tokens.
102
+
103
+ Args:
104
+ query_emb: [query_tokens, embedding_dim]
105
+ doc_emb: [doc_tokens, embedding_dim]
106
+
107
+ Returns:
108
+ Normalized similarity score
109
+ """
110
+ # Compute dot product between all token pairs
111
+ similarity_matrix = torch.matmul(query_emb, doc_emb.T)
112
+
113
+ # Get maximum similarity for each query token (MaxSim)
114
+ max_similarities = similarity_matrix.max(dim=1)[0]
115
+
116
+ # Return average of maximum similarities (normalized by query length)
117
+ return max_similarities.sum() / query_emb.shape[0]
118
+
119
+ # Extract embeddings for each text
120
+ query_emb = embeddings[0]
121
+ positive_emb = embeddings[1]
122
+ negative_emb = embeddings[2]
123
+
124
+ # Compute similarity scores
125
+ positive_score = colbert_similarity(query_emb, positive_emb)
126
+ negative_score = colbert_similarity(query_emb, negative_emb)
127
+
128
+ print(f"Similarity with positive document: {positive_score.item():.4f}")
129
+ print(f"Similarity with negative document: {negative_score.item():.4f}")
130
  ```
131
 
132
+ ## Framework Versions
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
133
  - Python: 3.10.12
134
  - Sentence Transformers: 3.4.1
135
  - PyLate: 1.1.7
 
139
  - Datasets: 3.3.1
140
  - Tokenizers: 0.21.0
141
 
142
+ ## Purpose
143
+ This tuned model is designed for **Spanish applications** that require the use of **efficient semantic search** comparing embeddings at the token level with its MaxSim operation, ideal for **question-answering and document retrieval**.
144
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
145
 
146
+ - **Developed by:** raialvaro
147
+ - **License:** apache-2.0