--- license: other license_name: fiqa-2018-non-commercial base_model: mixedbread-ai/mxbai-edge-colbert-v0-32m datasets: - stefan-jo/fiqa-train-mined-reranker-scores library_name: PyLate pipeline_tag: sentence-similarity tags: - ColBERT - PyLate - sentence-transformers - retrieval - pooling - kmeans - distillation --- # mxbai-edge-colbert-v0-32m FiQA KMeans PF1-6 This is a research checkpoint based on [`mixedbread-ai/mxbai-edge-colbert-v0-32m`](https://huggingface.co/mixedbread-ai/mxbai-edge-colbert-v0-32m), fine-tuned on FiQA training data with pooling-aware distillation. During training, document embeddings were pooled with **k-means** and the pool factor was sampled uniformly from **1 to 6** on each batch. The model is released as a companion artifact for the paper *Learn to Pool: Lightweight Fine-Tuning for Flexible Multi-Vector Compression*. It is intended primarily for research, reproduction, and further experimentation with pooling-aware late interaction retrieval. ## Model Details - Base model: `mixedbread-ai/mxbai-edge-colbert-v0-32m` - Architecture: ColBERT / late interaction retrieval - Query length: 32 - Document length: 300 - Embedding dimension: 64 - Training setup: k-means pooling, multi-factor `PF1-6` - Training dataset: `stefan-jo/fiqa-train-mined-reranker-scores` ## Intended Use This model is intended for: - reproducing the paper's FiQA pooling-aware fine-tuning results - experimenting with pooling-aware late interaction retrieval - studying how multi-factor training affects retrieval under document compression It is not presented as a general-purpose production retriever. ## Usage ```python from pylate import models model = models.ColBERT( model_name_or_path="stefan-jo/mxbai-edge-colbert-v0-32m-fiqa-kmeans-pf1-6" ) queries_embeddings = model.encode( ["What is the Sharpe ratio?"], is_query=True, ) documents_embeddings = model.encode( ["The Sharpe ratio measures return relative to risk."], is_query=False, pool_factor=4, pool_method="kmeans", use_sklearn=True, ) ``` ## Evaluation Snapshot The table below is adapted from the paper's NanoBEIR cross-dataset effects table. All runs use **k-means pooling** at inference and report **NDCG@10**. | Dataset | PF | Baseline | FT SciFact KMeans PF1-6 | FT FiQA KMeans PF1-6 | | --- | ---: | ---: | ---: | ---: | | SciFact | 1 | 0.808 | **0.817** | 0.802 | | SciFact | 2 | 0.765 | **0.813** | 0.774 | | SciFact | 4 | 0.649 | **0.810** | 0.808 | | SciFact | 6 | 0.609 | **0.795** | 0.758 | | FiQA | 1 | 0.526 | 0.523 | **0.528** | | FiQA | 2 | 0.488 | **0.519** | 0.505 | | FiQA | 4 | 0.470 | 0.490 | **0.513** | | FiQA | 6 | 0.431 | 0.459 | **0.467** | | NFCorpus | 1 | **0.375** | 0.372 | 0.369 | | NFCorpus | 2 | 0.370 | 0.370 | **0.378** | | NFCorpus | 4 | 0.342 | 0.363 | **0.381** | | NFCorpus | 6 | 0.307 | 0.363 | **0.369** | | SCIDOCS | 1 | **0.396** | 0.393 | 0.382 | | SCIDOCS | 2 | 0.371 | **0.385** | 0.375 | | SCIDOCS | 4 | 0.347 | 0.374 | **0.387** | | SCIDOCS | 6 | 0.332 | **0.380** | 0.371 | | Touché2020 | 1 | **0.596** | 0.595 | 0.592 | | Touché2020 | 2 | 0.597 | **0.602** | 0.601 | | Touché2020 | 4 | 0.565 | **0.594** | 0.572 | | Touché2020 | 6 | 0.545 | **0.573** | 0.571 | For full experiments and additional tables, see the accompanying paper and repository: - Paper: [stefan-jo.github.io/learn-to-pool/downloads/paper.pdf](https://stefan-jo.github.io/learn-to-pool/downloads/paper.pdf) - Code and aggregate metrics: [stefan-jo/pylate](https://github.com/stefan-jo/pylate) ## Training Data Provenance The training dataset was built from FiQA train data using a mining-and-reranking pipeline: - hard negative mining with `BAAI/bge-small-en-v1.5` - teacher scores from `BAAI/bge-reranker-v2-gemma` - distillation training on mined candidate sets with reranker scores The released training dataset is available separately as `stefan-jo/fiqa-train-mined-reranker-scores`. ## License and Provenance This model is released under a **custom non-commercial notice** because it was fine-tuned on FiQA-2018-derived training data. The official FiQA-2018 source states that the relevant Opinion-based QA data are available only for non-commercial use. Relevant upstream components: - base model: `mixedbread-ai/mxbai-edge-colbert-v0-32m` (`Apache-2.0`) - mining model: `BAAI/bge-small-en-v1.5` (`MIT`) - reranker: `BAAI/bge-reranker-v2-gemma` (`Apache-2.0`) - training data source: FiQA-2018 / BEIR-style preprocessing Users should review and comply with the upstream dataset terms in addition to this repository notice.