--- language: - en tags: - ColBERT - PyLate - sentence-transformers - sentence-similarity - feature-extraction - late-interaction - reasoning-retrieval - edge - generated_from_trainer - loss:CachedContrastive base_model: mixedbread-ai/mxbai-edge-colbert-v0-32m datasets: - hanhainebula/bge-reasoner-data - reasonir/reasonir-data pipeline_tag: sentence-similarity library_name: PyLate license: apache-2.0 ---

# Reason-mxbai-colbert-v0-32m **Reason-mxbai-colbert-v0-32m** is a ~32M-parameter late-interaction retriever, fine-tuned from [`mixedbread-ai/mxbai-edge-colbert-v0-32m`](https://huggingface.co/mixedbread-ai/mxbai-edge-colbert-v0-32m) for reasoning-intensive retrieval on the [BRIGHT benchmark](https://huggingface.co/datasets/xlangai/BRIGHT). It is an **edge-scale sibling of [Reason-ModernColBERT](https://huggingface.co/lightonai/Reason-ModernColBERT)** (150M): the same late-interaction recipe applied to an order-of-magnitude smaller backbone, with a **widened projection head (64 → 128 dim)** and a **two-stage curriculum** on VL + BGE-reasoner + ReasonIR-HQ hard negatives. Average BRIGHT nDCG@10 = **19.00** at ~5× smaller inference cost than Reason-ModernColBERT (150M). See the [Evaluation](#evaluation) section for per-split numbers. ## Model Details - **Model Type:** PyLate ColBERT (late-interaction, multi-vector) - **Base model:** [mixedbread-ai/mxbai-edge-colbert-v0-32m](https://huggingface.co/mixedbread-ai/mxbai-edge-colbert-v0-32m) - **Parameters:** ~32M (backbone) + widened projection head - **Document Length (training):** 2048 tokens - **Query Length (training):** 256 tokens - **Output Dimensionality:** **128** per token (widened from the base's 64-dim) - **Similarity Function:** MaxSim - **Training Data:** - [hanhainebula/bge-reasoner-data](https://huggingface.co/datasets/hanhainebula/bge-reasoner-data) — 12 BRIGHT-domain instruction-prefixed triples - [reasonir/reasonir-data](https://huggingface.co/datasets/reasonir/reasonir-data) — `vl` split (warmup) and `hq` with hard negatives (polish) - **Language:** en - **License:** CC-BY-NC-4.0 (inherited from training data) ## Model Architecture ``` ColBERT( (0): Transformer({'max_seq_length': 127, 'do_lower_case': True}) with ModernBertModel hidden_size=384, num_hidden_layers=10, num_attention_heads=6, position_embedding_type='sans_pos', max_position_embeddings=7999 (1): Dense(384 → 768, bias=False) (2): Dense(768 → 768, bias=False) (3): Dense(768 → 128, bias=False) # widened from 64 → 128 to give MaxSim more channels ) ``` The final projection head was widened from 64 → 128 using a small-random initialization (std = 10% of existing row std) so the new channels receive non-zero gradient from the start. See `training/widen_colbert_projection.py`. ## Why widen the projection? The base `mxbai-edge-colbert-v0-32m` outputs 64-dim per-token vectors. On reasoning-intensive retrieval with many structurally-similar tokens (code syntax, LaTeX math notation, operator punctuation), 64 channels saturate fast — MaxSim discrimination on those splits hits an architectural ceiling. Widening to 128-dim doubles the per-token channel budget, matching Reason-ModernColBERT's output dimensionality. The base weights are preserved exactly on the first 64 dims; only the extra 64 dims are learned during fine-tuning. ## Usage First install PyLate: ```bash pip install -U pylate ``` ### Retrieval ```python from pylate import indexes, models, retrieve model = models.ColBERT(model_name_or_path="DataScience-UIBK/Reason-mxbai-colbert-v0-32m") index = indexes.Voyager(index_folder="pylate-index", index_name="index", override=True) docs = ["document 1 text", "document 2 text", "document 3 text"] doc_ids = ["1", "2", "3"] doc_embs = model.encode(docs, batch_size=32, is_query=False, show_progress_bar=True) index.add_documents(documents_ids=doc_ids, documents_embeddings=doc_embs) retriever = retrieve.ColBERT(index=index) query_embs = model.encode( ["Given a Biology post, retrieve relevant passages that help answer the post.\nQuery: how do cells divide?"], is_query=True, ) scores = retriever.retrieve(queries_embeddings=query_embs, k=10) ``` **Tip:** for best results on BRIGHT-style queries, prepend the domain instruction (`Given a {Biology|Coding|Math|...} post, retrieve relevant passages...`) followed by `\nQuery: {raw_query}` — that's the format the model was trained on (via the BGE-reasoner data). ### Reranking (no index) ```python from pylate import rank, models model = models.ColBERT(model_name_or_path="DataScience-UIBK/Reason-mxbai-colbert-v0-32m") queries = ["query A", "query B"] documents = [["document A", "document B"], ["document 1", "document C", "document B"]] doc_ids = [[1, 2], [1, 3, 2]] q_embs = model.encode(queries, is_query=True) d_embs = model.encode(documents, is_query=False) reranked = rank.rerank( documents_ids=doc_ids, queries_embeddings=q_embs, documents_embeddings=d_embs, ) ``` ## Evaluation ### BRIGHT Benchmark — nDCG@10 Evaluated with the [MTEB BrightRetrieval](https://github.com/embeddings-benchmark/mteb) task via `evaluation/evaluate_bright.py`. `query_length=256` (pony=32) and `document_length=2048` (matches training setup). | Split | nDCG@10 | |---|---:| | Biology | **32.71** | | Earth Science | **43.88** | | Economics | 18.70 | | Psychology | 22.62 | | Robotics | 18.43 | | Stackoverflow | 16.78 | | Sustainable Living | **20.77** | | Leetcode | 17.67 | | Pony | **20.73** | | AoPS | 5.05 | | Theorem — Q | 8.38 | | Theorem — T | 2.25 | | **Full mean** | **19.00** | Raw per-split JSON under `results/`. ### Context - [Reason-ModernColBERT (150M)](https://huggingface.co/lightonai/Reason-ModernColBERT): 22.62 mean at ~5× the parameter count. - Dense single-vector baselines at similar scale (< 1B): ~13-15 mean. - Our 64-dim predecessor (mxbai-edge-32m trained on same curriculum, pre-widening): ~18.4 mean. On the natural-language and instruction-following splits (biology, earth_science, sustainable_living, pony, psychology) the 32M is competitive with or beats the 150M Reason-ModernColBERT on individual splits. It lags on symbol-dense splits (leetcode, stackoverflow, aops, theoremqa) because of architectural choices in the base model: case-insensitive tokenizer, no global positional embeddings (`sans_pos`), and a shallow 10-layer backbone — these cannot be recovered by training and cap performance on code / formal-math retrieval. ## Training Two-stage curriculum on 8 H100 GPUs (2 nodes × 4 GPUs, matching Reason-ModernColBERT's 8-GPU setup): 1. **Widen projection head**: Small-random init for the new 64 channels, verified non-zero at encode time. 2. **Stage 1 (VL warmup)**: 1 epoch on `reasonir/reasonir-data` VL split (~181k triples), `lr=1e-5`, global batch 2048, `query_length=256`, `document_length=2048`. 3. **Stage 2 (BGE + HQ-hn polish)**: 1 epoch on merged [BGE-reasoner](https://huggingface.co/datasets/hanhainebula/bge-reasoner-data) (12 BRIGHT-domain triples with instruction prefixes) + ReasonIR-HQ with hard negatives (~2.7M triples total), `lr=5e-6`, global batch 2048. ### Training loss - `pylate.losses.cached_contrastive.CachedContrastive` (temperature=1.0, `gather_across_devices=True`). - `max_grad_norm=100` (set via env var; default 1.0 over-clips when the widened projection has high bootstrap gradients). Expected total wall-clock on 2 × 4 × H100: 6-8 hours. ## Evaluation (reproduce) ```bash python evaluation/evaluate_bright.py \ --model_path \ --model_version baseline \ --query_length 256 \ --document_length 2048 \ --output_root results/ ``` ## License apache-2.0 ## Citation If you find this model useful, please cite the upstream work it builds on: ```bibtex @misc{Reason-mxbai-colbert-v0-32m, title={Reason-mxbai-colbert-v0-32m}, author={Abdelrahman Abdallah and Adam Jatowt}, url={https://huggingface.co/DataScience-UIBK/Reason-mxbai-colbert-v0-32m}, year={2025} } @misc{Reason-ModernColBERT, title={Reason-ModernColBERT}, author={Chaffin, Antoine}, url={https://huggingface.co/lightonai/Reason-ModernColBERT}, year={2025} } @misc{mxbai-edge-colbert-v0-32m, title={mxbai-edge-colbert-v0-32m}, author={Mixedbread AI}, url={https://huggingface.co/mixedbread-ai/mxbai-edge-colbert-v0-32m}, year={2025} } ``` ## Framework Versions - Python: 3.10 - PyLate: 1.1.7+ - Sentence Transformers: 4.0.2 - Transformers: 4.48.2 - PyTorch: 2.5.1 (CUDA 12.4) - Accelerate: 1.1.1 - Datasets: 2.21.0