---
license: other
license_name: fiqa-2018-non-commercial
base_model: mixedbread-ai/mxbai-edge-colbert-v0-32m
datasets:
- stefan-jo/fiqa-train-mined-reranker-scores
library_name: PyLate
pipeline_tag: sentence-similarity
tags:
- ColBERT
- PyLate
- sentence-transformers
- retrieval
- pooling
- kmeans
- distillation
---

# mxbai-edge-colbert-v0-32m FiQA KMeans PF1-6

This is a research checkpoint based on [`mixedbread-ai/mxbai-edge-colbert-v0-32m`](https://huggingface.co/mixedbread-ai/mxbai-edge-colbert-v0-32m), fine-tuned on FiQA training data with pooling-aware distillation. During training, document embeddings were pooled with **k-means** and the pool factor was sampled uniformly from **1 to 6** on each batch.

The model is released as a companion artifact for the paper *Learn to Pool: Lightweight Fine-Tuning for Flexible Multi-Vector Compression*. It is intended primarily for research, reproduction, and further experimentation with pooling-aware late interaction retrieval.

## Model Details

- Base model: `mixedbread-ai/mxbai-edge-colbert-v0-32m`
- Architecture: ColBERT / late interaction retrieval
- Query length: 32
- Document length: 300
- Embedding dimension: 64
- Training setup: k-means pooling, multi-factor `PF1-6`
- Training dataset: `stefan-jo/fiqa-train-mined-reranker-scores`

## Intended Use

This model is intended for:

- reproducing the paper's FiQA pooling-aware fine-tuning results
- experimenting with pooling-aware late interaction retrieval
- studying how multi-factor training affects retrieval under document compression

It is not presented as a general-purpose production retriever.

## Usage

```python
from pylate import models

model = models.ColBERT(
    model_name_or_path="stefan-jo/mxbai-edge-colbert-v0-32m-fiqa-kmeans-pf1-6"
)

queries_embeddings = model.encode(
    ["What is the Sharpe ratio?"],
    is_query=True,
)

documents_embeddings = model.encode(
    ["The Sharpe ratio measures return relative to risk."],
    is_query=False,
    pool_factor=4,
    pool_method="kmeans",
    use_sklearn=True,
)
```

## Evaluation Snapshot

The table below is adapted from the paper's NanoBEIR cross-dataset effects table. All runs use **k-means pooling** at inference and report **NDCG@10**.

| Dataset | PF | Baseline | FT SciFact KMeans PF1-6 | FT FiQA KMeans PF1-6 |
| --- | ---: | ---: | ---: | ---: |
| SciFact | 1 | 0.808 | **0.817** | 0.802 |
| SciFact | 2 | 0.765 | **0.813** | 0.774 |
| SciFact | 4 | 0.649 | **0.810** | 0.808 |
| SciFact | 6 | 0.609 | **0.795** | 0.758 |
| FiQA | 1 | 0.526 | 0.523 | **0.528** |
| FiQA | 2 | 0.488 | **0.519** | 0.505 |
| FiQA | 4 | 0.470 | 0.490 | **0.513** |
| FiQA | 6 | 0.431 | 0.459 | **0.467** |
| NFCorpus | 1 | **0.375** | 0.372 | 0.369 |
| NFCorpus | 2 | 0.370 | 0.370 | **0.378** |
| NFCorpus | 4 | 0.342 | 0.363 | **0.381** |
| NFCorpus | 6 | 0.307 | 0.363 | **0.369** |
| SCIDOCS | 1 | **0.396** | 0.393 | 0.382 |
| SCIDOCS | 2 | 0.371 | **0.385** | 0.375 |
| SCIDOCS | 4 | 0.347 | 0.374 | **0.387** |
| SCIDOCS | 6 | 0.332 | **0.380** | 0.371 |
| Touché2020 | 1 | **0.596** | 0.595 | 0.592 |
| Touché2020 | 2 | 0.597 | **0.602** | 0.601 |
| Touché2020 | 4 | 0.565 | **0.594** | 0.572 |
| Touché2020 | 6 | 0.545 | **0.573** | 0.571 |

For full experiments and additional tables, see the accompanying paper and repository:

- Paper: [stefan-jo.github.io/learn-to-pool/downloads/paper.pdf](https://stefan-jo.github.io/learn-to-pool/downloads/paper.pdf)
- Code and aggregate metrics: [stefan-jo/pylate](https://github.com/stefan-jo/pylate)

## Training Data Provenance

The training dataset was built from FiQA train data using a mining-and-reranking pipeline:

- hard negative mining with `BAAI/bge-small-en-v1.5`
- teacher scores from `BAAI/bge-reranker-v2-gemma`
- distillation training on mined candidate sets with reranker scores

The released training dataset is available separately as `stefan-jo/fiqa-train-mined-reranker-scores`.

## License and Provenance

This model is released under a **custom non-commercial notice** because it was fine-tuned on FiQA-2018-derived training data. The official FiQA-2018 source states that the relevant Opinion-based QA data are available only for non-commercial use.

Relevant upstream components:

- base model: `mixedbread-ai/mxbai-edge-colbert-v0-32m` (`Apache-2.0`)
- mining model: `BAAI/bge-small-en-v1.5` (`MIT`)
- reranker: `BAAI/bge-reranker-v2-gemma` (`Apache-2.0`)
- training data source: FiQA-2018 / BEIR-style preprocessing

Users should review and comply with the upstream dataset terms in addition to this repository notice.