legal-colbert-clause-retriever

A small, open late-interaction (ColBERT) retriever fine-tuned for finding clauses in legal contracts — termination, assignment, limitation of liability, IP ownership, non-compete, governing law, and ~35 other common provision types. It maps queries and contract passages to sequences of 128-d token vectors and scores them with the MaxSim operator.

It is a continuation fine-tune of lightonai/GTE-ModernColBERT-v1 (149M params, ModernBERT-base backbone).

Results

Evaluated on the MLEB Contractual Clause Retrieval task (NDCG@10), the published benchmark for legal contract clause retrieval. Our evaluation reproduces the official leaderboard protocol exactly (BGE-M3 scores 0.7281 through our harness, matching the leaderboard to 4 decimals).

Metric	Score
NDCG@10	0.8338
MAP	0.7713
Recall@10	0.9556

At 149M parameters this is the best accuracy-per-parameter open model on the task — 3rd of 17 open-source models, ahead of Google's EmbeddingGemma (308M, 0.829) and the same-size legal peer Free Law ModernBERT (0.764), and behind only Qwen3-Embedding-4B/8B (which are 27–53× larger).

Usage

pip install pylate

from pylate import models, rank

model = models.ColBERT("kmad00/legal-colbert-clause-retriever")

# Describe the clause you want to find
queries = model.encode(
    ["This is a contractual provision that limits the maximum liability a party can incur."],
    is_query=True,
)

# Candidate contract passages
documents = model.encode(
    [
        "In no event shall either party's aggregate liability exceed the fees paid in the prior twelve months...",
        "This Agreement shall be governed by the laws of the State of Delaware...",
    ],
    is_query=False,
)

scores = rank.rerank(
    documents_ids=[["0", "1"]],
    queries_embeddings=queries,
    documents_embeddings=[documents],
)
print(scores)

Queries can be plain clause names ("governing law"), natural-language definitions, or questions — the model is robust to phrasing. Document length 300 tokens, query length 48, output dim 128, similarity = MaxSim.

Supported clause types

Trained and evaluated on 41 CUAD clause categories plus ACORD drafting queries and LEDGAR provision labels, including: Cap on Liability / Uncapped Liability, IP Ownership Assignment, Joint IP Ownership, License Grant, Non-Compete, Anti-Assignment, Change of Control, Governing Law, Termination for Convenience, Renewal Term, Audit Rights, Insurance, Most Favored Nation, Exclusivity, Liquidated Damages, Source Code Escrow, ROFR/ROFO/ROFN, and more. As a retriever (not a fixed classifier) it also generalizes to clause types outside this set.

License

CC BY 4.0. This model is a derivative of CC BY 4.0 training data (CUAD, ACORD, LEDGAR) and an Apache 2.0 base model. You may use it commercially and non-commercially; attribution is required (see below). No share-alike obligation applies.

Base model

lightonai/GTE-ModernColBERT-v1 — Apache 2.0 (← Alibaba-NLP/gte-modernbert-base ← answerdotai/ModernBERT-base)

Training data

Produced by a chain of light continuation fine-tunes. Across the full lineage it was trained on the following datasets (and no others):

CUAD — CC BY 4.0. Hendrycks et al., "CUAD: An Expert-Annotated NLP Dataset for Legal Contract Review," NeurIPS 2021. The Atticus Project.
ACORD — CC BY 4.0. The Atticus Project, "ACORD: An Expert-Annotated Retrieval Dataset for Legal Contract Drafting," 2025.
LEDGAR (LexGLUE ledgar config) — CC BY 4.0. Tuggener et al., "LEDGAR," LREC 2020; derived from public-domain US SEC EDGAR filings. Chalkidis et al., "LexGLUE," ACL 2022.

Hard negatives were mined with BM25 from each dataset's own corpus. No MLEB / isaacus/contractual-clause-retrieval data and no web-scraped data were used in training — MLEB is used only as an evaluation benchmark.

Limitations

English-language commercial contracts (US-style); other jurisdictions/languages are out of distribution.
Late-interaction (multi-vector) storage is heavier per document than single-vector embedders.
The MLEB clause task is small (90 docs); treat ±1–2 points as noise.
Trained on a narrow set of clause types; confidence is lower on provision types far from the training taxonomy.

Acknowledgments

Training data: The Atticus Project (CUAD, ACORD); Tuggener et al. & coastalcph/LexGLUE (LEDGAR).
Base model: LightOn (GTE-ModernColBERT-v1), built with PyLate.
Benchmark: Isaacus (MLEB) — evaluation only, not training.

Full model architecture

ColBERT(
  (0): Transformer({'max_seq_length': 299, 'architecture': 'ModernBertModel'})
  (1): Dense({'in_features': 768, 'out_features': 128, 'bias': False, 'activation_function': 'Identity'})
)

Downloads last month: 32

Safetensors

Model size

0.1B params

Tensor type

F32

Model tree for kmad00/legal-colbert-clause-retriever

Base model

answerdotai/ModernBERT-base

Finetuned

Alibaba-NLP/gte-modernbert-base

Quantized

lightonai/GTE-ModernColBERT-v1

Finetuned

(14)

this model

kmad00
/

legal-colbert-clause-retriever

legal-colbert-clause-retriever

Results

Usage

Supported clause types

License

Base model

Training data

Limitations

Acknowledgments

Full model architecture

Model tree for kmad00/legal-colbert-clause-retriever

Datasets used to train kmad00/legal-colbert-clause-retriever