--- license: cc-by-4.0 pipeline_tag: sentence-similarity library_name: PyLate base_model: lightonai/GTE-ModernColBERT-v1 datasets: - theatticusproject/cuad-qa - theatticusproject/acord - coastalcph/lex_glue language: - en tags: - ColBERT - PyLate - late-interaction - sentence-transformers - feature-extraction - legal - contracts - clause-retrieval - retrieval --- # legal-colbert-clause-retriever A small, open **late-interaction (ColBERT)** retriever fine-tuned for **finding clauses in legal contracts** — termination, assignment, limitation of liability, IP ownership, non-compete, governing law, and ~35 other common provision types. It maps queries and contract passages to sequences of 128-d token vectors and scores them with the MaxSim operator. It is a continuation fine-tune of [`lightonai/GTE-ModernColBERT-v1`](https://huggingface.co/lightonai/GTE-ModernColBERT-v1) (149M params, ModernBERT-base backbone). ## Results Evaluated on the **[MLEB](https://isaacus.com/mleb) Contractual Clause Retrieval** task (NDCG@10), the published benchmark for legal contract clause retrieval. Our evaluation reproduces the official leaderboard protocol exactly (BGE-M3 scores 0.7281 through our harness, matching the leaderboard to 4 decimals). | Metric | Score | |---|---| | **NDCG@10** | **0.8338** | | MAP | 0.7713 | | Recall@10 | 0.9556 | **At 149M parameters this is the best accuracy-per-parameter open model on the task** — 3rd of 17 open-source models, ahead of Google's EmbeddingGemma (308M, 0.829) and the same-size legal peer Free Law ModernBERT (0.764), and behind only Qwen3-Embedding-4B/8B (which are 27–53× larger). ![Model size vs NDCG@10 on MLEB Contractual Clause Retrieval (open-source models)](clause_size_vs_ndcg.png) ## Usage ```bash pip install pylate ``` ```python from pylate import models, rank model = models.ColBERT("kmad00/legal-colbert-clause-retriever") # Describe the clause you want to find queries = model.encode( ["This is a contractual provision that limits the maximum liability a party can incur."], is_query=True, ) # Candidate contract passages documents = model.encode( [ "In no event shall either party's aggregate liability exceed the fees paid in the prior twelve months...", "This Agreement shall be governed by the laws of the State of Delaware...", ], is_query=False, ) scores = rank.rerank( documents_ids=[["0", "1"]], queries_embeddings=queries, documents_embeddings=[documents], ) print(scores) ``` Queries can be plain clause names (`"governing law"`), natural-language definitions, or questions — the model is robust to phrasing. Document length 300 tokens, query length 48, output dim 128, similarity = MaxSim. ## Supported clause types Trained and evaluated on 41 CUAD clause categories plus ACORD drafting queries and LEDGAR provision labels, including: Cap on Liability / Uncapped Liability, IP Ownership Assignment, Joint IP Ownership, License Grant, Non-Compete, Anti-Assignment, Change of Control, Governing Law, Termination for Convenience, Renewal Term, Audit Rights, Insurance, Most Favored Nation, Exclusivity, Liquidated Damages, Source Code Escrow, ROFR/ROFO/ROFN, and more. As a retriever (not a fixed classifier) it also generalizes to clause types outside this set. ## License **CC BY 4.0.** This model is a derivative of CC BY 4.0 training data (CUAD, ACORD, LEDGAR) and an Apache 2.0 base model. You may use it commercially and non-commercially; attribution is required (see below). No share-alike obligation applies. ## Base model - [lightonai/GTE-ModernColBERT-v1](https://huggingface.co/lightonai/GTE-ModernColBERT-v1) — Apache 2.0 (← [Alibaba-NLP/gte-modernbert-base](https://huggingface.co/Alibaba-NLP/gte-modernbert-base) ← [answerdotai/ModernBERT-base](https://huggingface.co/answerdotai/ModernBERT-base)) ## Training data Produced by a chain of light continuation fine-tunes. Across the full lineage it was trained on the following datasets (and no others): - **[CUAD](https://huggingface.co/datasets/theatticusproject/cuad-qa)** — CC BY 4.0. Hendrycks et al., "CUAD: An Expert-Annotated NLP Dataset for Legal Contract Review," NeurIPS 2021. The Atticus Project. - **[ACORD](https://huggingface.co/datasets/theatticusproject/acord)** — CC BY 4.0. The Atticus Project, "ACORD: An Expert-Annotated Retrieval Dataset for Legal Contract Drafting," 2025. - **[LEDGAR](https://huggingface.co/datasets/coastalcph/lex_glue)** (LexGLUE `ledgar` config) — CC BY 4.0. Tuggener et al., "LEDGAR," LREC 2020; derived from public-domain US SEC EDGAR filings. Chalkidis et al., "LexGLUE," ACL 2022. Hard negatives were mined with BM25 from each dataset's own corpus. **No MLEB / `isaacus/contractual-clause-retrieval` data and no web-scraped data were used in training** — MLEB is used only as an evaluation benchmark. ## Limitations - English-language commercial contracts (US-style); other jurisdictions/languages are out of distribution. - Late-interaction (multi-vector) storage is heavier per document than single-vector embedders. - The MLEB clause task is small (90 docs); treat ±1–2 points as noise. - Trained on a narrow set of clause types; confidence is lower on provision types far from the training taxonomy. ## Acknowledgments - Training data: [The Atticus Project](https://www.atticusprojectai.org/) (CUAD, ACORD); Tuggener et al. & [coastalcph/LexGLUE](https://github.com/coastalcph/lex-glue) (LEDGAR). - Base model: [LightOn](https://huggingface.co/lightonai) (GTE-ModernColBERT-v1), built with [PyLate](https://github.com/lightonai/pylate). - Benchmark: [Isaacus](https://isaacus.com/mleb) (MLEB) — evaluation only, not training. ## Full model architecture ``` ColBERT( (0): Transformer({'max_seq_length': 299, 'architecture': 'ModernBertModel'}) (1): Dense({'in_features': 768, 'out_features': 128, 'bias': False, 'activation_function': 'Identity'}) ) ```