---
language:
- en
tags:
- ColBERT
- PyLate
- sentence-transformers
- sentence-similarity
- feature-extraction
- late-interaction
- reasoning-retrieval
- edge
- generated_from_trainer
- loss:CachedContrastive
base_model: mixedbread-ai/mxbai-edge-colbert-v0-32m
datasets:
- hanhainebula/bge-reasoner-data
- reasonir/reasonir-data
pipeline_tag: sentence-similarity
library_name: PyLate
license: apache-2.0
---
# Reason-mxbai-colbert-v0-32m
**Reason-mxbai-colbert-v0-32m** is a ~32M-parameter late-interaction retriever, fine-tuned from [`mixedbread-ai/mxbai-edge-colbert-v0-32m`](https://huggingface.co/mixedbread-ai/mxbai-edge-colbert-v0-32m) for reasoning-intensive retrieval on the [BRIGHT benchmark](https://huggingface.co/datasets/xlangai/BRIGHT).
It is an **edge-scale sibling of [Reason-ModernColBERT](https://huggingface.co/lightonai/Reason-ModernColBERT)** (150M): the same late-interaction recipe applied to an order-of-magnitude smaller backbone, with a **widened projection head (64 → 128 dim)** and a **two-stage curriculum** on VL + BGE-reasoner + ReasonIR-HQ hard negatives.
Average BRIGHT nDCG@10 = **19.00** at ~5× smaller inference cost than Reason-ModernColBERT (150M). See the [Evaluation](#evaluation) section for per-split numbers.
## Model Details
- **Model Type:** PyLate ColBERT (late-interaction, multi-vector)
- **Base model:** [mixedbread-ai/mxbai-edge-colbert-v0-32m](https://huggingface.co/mixedbread-ai/mxbai-edge-colbert-v0-32m)
- **Parameters:** ~32M (backbone) + widened projection head
- **Document Length (training):** 2048 tokens
- **Query Length (training):** 256 tokens
- **Output Dimensionality:** **128** per token (widened from the base's 64-dim)
- **Similarity Function:** MaxSim
- **Training Data:**
- [hanhainebula/bge-reasoner-data](https://huggingface.co/datasets/hanhainebula/bge-reasoner-data) — 12 BRIGHT-domain instruction-prefixed triples
- [reasonir/reasonir-data](https://huggingface.co/datasets/reasonir/reasonir-data) — `vl` split (warmup) and `hq` with hard negatives (polish)
- **Language:** en
- **License:** CC-BY-NC-4.0 (inherited from training data)
## Model Architecture
```
ColBERT(
(0): Transformer({'max_seq_length': 127, 'do_lower_case': True}) with ModernBertModel
hidden_size=384, num_hidden_layers=10, num_attention_heads=6,
position_embedding_type='sans_pos', max_position_embeddings=7999
(1): Dense(384 → 768, bias=False)
(2): Dense(768 → 768, bias=False)
(3): Dense(768 → 128, bias=False) # widened from 64 → 128 to give MaxSim more channels
)
```
The final projection head was widened from 64 → 128 using a small-random initialization (std = 10% of existing row std) so the new channels receive non-zero gradient from the start. See `training/widen_colbert_projection.py`.
## Why widen the projection?
The base `mxbai-edge-colbert-v0-32m` outputs 64-dim per-token vectors. On reasoning-intensive retrieval with many structurally-similar tokens (code syntax, LaTeX math notation, operator punctuation), 64 channels saturate fast — MaxSim discrimination on those splits hits an architectural ceiling. Widening to 128-dim doubles the per-token channel budget, matching Reason-ModernColBERT's output dimensionality. The base weights are preserved exactly on the first 64 dims; only the extra 64 dims are learned during fine-tuning.
## Usage
First install PyLate:
```bash
pip install -U pylate
```
### Retrieval
```python
from pylate import indexes, models, retrieve
model = models.ColBERT(model_name_or_path="DataScience-UIBK/Reason-mxbai-colbert-v0-32m")
index = indexes.Voyager(index_folder="pylate-index", index_name="index", override=True)
docs = ["document 1 text", "document 2 text", "document 3 text"]
doc_ids = ["1", "2", "3"]
doc_embs = model.encode(docs, batch_size=32, is_query=False, show_progress_bar=True)
index.add_documents(documents_ids=doc_ids, documents_embeddings=doc_embs)
retriever = retrieve.ColBERT(index=index)
query_embs = model.encode(
["Given a Biology post, retrieve relevant passages that help answer the post.\nQuery: how do cells divide?"],
is_query=True,
)
scores = retriever.retrieve(queries_embeddings=query_embs, k=10)
```
**Tip:** for best results on BRIGHT-style queries, prepend the domain instruction (`Given a {Biology|Coding|Math|...} post, retrieve relevant passages...`) followed by `\nQuery: {raw_query}` — that's the format the model was trained on (via the BGE-reasoner data).
### Reranking (no index)
```python
from pylate import rank, models
model = models.ColBERT(model_name_or_path="DataScience-UIBK/Reason-mxbai-colbert-v0-32m")
queries = ["query A", "query B"]
documents = [["document A", "document B"], ["document 1", "document C", "document B"]]
doc_ids = [[1, 2], [1, 3, 2]]
q_embs = model.encode(queries, is_query=True)
d_embs = model.encode(documents, is_query=False)
reranked = rank.rerank(
documents_ids=doc_ids,
queries_embeddings=q_embs,
documents_embeddings=d_embs,
)
```
## Evaluation
### BRIGHT Benchmark — nDCG@10
Evaluated with the [MTEB BrightRetrieval](https://github.com/embeddings-benchmark/mteb) task via `evaluation/evaluate_bright.py`. `query_length=256` (pony=32) and `document_length=2048` (matches training setup).
| Split | nDCG@10 |
|---|---:|
| Biology | **32.71** |
| Earth Science | **43.88** |
| Economics | 18.70 |
| Psychology | 22.62 |
| Robotics | 18.43 |
| Stackoverflow | 16.78 |
| Sustainable Living | **20.77** |
| Leetcode | 17.67 |
| Pony | **20.73** |
| AoPS | 5.05 |
| Theorem — Q | 8.38 |
| Theorem — T | 2.25 |
| **Full mean** | **19.00** |
Raw per-split JSON under `results/`.
### Context
- [Reason-ModernColBERT (150M)](https://huggingface.co/lightonai/Reason-ModernColBERT): 22.62 mean at ~5× the parameter count.
- Dense single-vector baselines at similar scale (< 1B): ~13-15 mean.
- Our 64-dim predecessor (mxbai-edge-32m trained on same curriculum, pre-widening): ~18.4 mean.
On the natural-language and instruction-following splits (biology, earth_science, sustainable_living, pony, psychology) the 32M is competitive with or beats the 150M Reason-ModernColBERT on individual splits. It lags on symbol-dense splits (leetcode, stackoverflow, aops, theoremqa) because of architectural choices in the base model: case-insensitive tokenizer, no global positional embeddings (`sans_pos`), and a shallow 10-layer backbone — these cannot be recovered by training and cap performance on code / formal-math retrieval.
## Training
Two-stage curriculum on 8 H100 GPUs (2 nodes × 4 GPUs, matching Reason-ModernColBERT's 8-GPU setup):
1. **Widen projection head**: Small-random init for the new 64 channels, verified non-zero at encode time.
2. **Stage 1 (VL warmup)**: 1 epoch on `reasonir/reasonir-data` VL split (~181k triples), `lr=1e-5`, global batch 2048, `query_length=256`, `document_length=2048`.
3. **Stage 2 (BGE + HQ-hn polish)**: 1 epoch on merged [BGE-reasoner](https://huggingface.co/datasets/hanhainebula/bge-reasoner-data) (12 BRIGHT-domain triples with instruction prefixes) + ReasonIR-HQ with hard negatives (~2.7M triples total), `lr=5e-6`, global batch 2048.
### Training loss
- `pylate.losses.cached_contrastive.CachedContrastive` (temperature=1.0, `gather_across_devices=True`).
- `max_grad_norm=100` (set via env var; default 1.0 over-clips when the widened projection has high bootstrap gradients).
Expected total wall-clock on 2 × 4 × H100: 6-8 hours.
## Evaluation (reproduce)
```bash
python evaluation/evaluate_bright.py \
--model_path \
--model_version baseline \
--query_length 256 \
--document_length 2048 \
--output_root results/
```
## License
apache-2.0
## Citation
If you find this model useful, please cite the upstream work it builds on:
```bibtex
@misc{Reason-mxbai-colbert-v0-32m,
title={Reason-mxbai-colbert-v0-32m},
author={Abdelrahman Abdallah and Adam Jatowt},
url={https://huggingface.co/DataScience-UIBK/Reason-mxbai-colbert-v0-32m},
year={2025}
}
@misc{Reason-ModernColBERT,
title={Reason-ModernColBERT},
author={Chaffin, Antoine},
url={https://huggingface.co/lightonai/Reason-ModernColBERT},
year={2025}
}
@misc{mxbai-edge-colbert-v0-32m,
title={mxbai-edge-colbert-v0-32m},
author={Mixedbread AI},
url={https://huggingface.co/mixedbread-ai/mxbai-edge-colbert-v0-32m},
year={2025}
}
```
## Framework Versions
- Python: 3.10
- PyLate: 1.1.7+
- Sentence Transformers: 4.0.2
- Transformers: 4.48.2
- PyTorch: 2.5.1 (CUDA 12.4)
- Accelerate: 1.1.1
- Datasets: 2.21.0