legal-colbert-clause-retriever

A small, open late-interaction (ColBERT) retriever fine-tuned for finding clauses in legal contracts — termination, assignment, limitation of liability, IP ownership, non-compete, governing law, and ~35 other common provision types. It maps queries and contract passages to sequences of 128-d token vectors and scores them with the MaxSim operator.

It is a continuation fine-tune of lightonai/GTE-ModernColBERT-v1 (149M params, ModernBERT-base backbone).

Results

Evaluated on the MLEB Contractual Clause Retrieval task (NDCG@10), the published benchmark for legal contract clause retrieval. Our evaluation reproduces the official leaderboard protocol exactly (BGE-M3 scores 0.7281 through our harness, matching the leaderboard to 4 decimals).

Metric Score
NDCG@10 0.8338
MAP 0.7713
Recall@10 0.9556

At 149M parameters this is the best accuracy-per-parameter open model on the task — 3rd of 17 open-source models, ahead of Google's EmbeddingGemma (308M, 0.829) and the same-size legal peer Free Law ModernBERT (0.764), and behind only Qwen3-Embedding-4B/8B (which are 27–53× larger).

Model size vs NDCG@10 on MLEB Contractual Clause Retrieval (open-source models)

Usage

pip install pylate
from pylate import models, rank

model = models.ColBERT("kmad00/legal-colbert-clause-retriever")

# Describe the clause you want to find
queries = model.encode(
    ["This is a contractual provision that limits the maximum liability a party can incur."],
    is_query=True,
)

# Candidate contract passages
documents = model.encode(
    [
        "In no event shall either party's aggregate liability exceed the fees paid in the prior twelve months...",
        "This Agreement shall be governed by the laws of the State of Delaware...",
    ],
    is_query=False,
)

scores = rank.rerank(
    documents_ids=[["0", "1"]],
    queries_embeddings=queries,
    documents_embeddings=[documents],
)
print(scores)

Queries can be plain clause names ("governing law"), natural-language definitions, or questions — the model is robust to phrasing. Document length 300 tokens, query length 48, output dim 128, similarity = MaxSim.

Supported clause types

Trained and evaluated on 41 CUAD clause categories plus ACORD drafting queries and LEDGAR provision labels, including: Cap on Liability / Uncapped Liability, IP Ownership Assignment, Joint IP Ownership, License Grant, Non-Compete, Anti-Assignment, Change of Control, Governing Law, Termination for Convenience, Renewal Term, Audit Rights, Insurance, Most Favored Nation, Exclusivity, Liquidated Damages, Source Code Escrow, ROFR/ROFO/ROFN, and more. As a retriever (not a fixed classifier) it also generalizes to clause types outside this set.

License

CC BY 4.0. This model is a derivative of CC BY 4.0 training data (CUAD, ACORD, LEDGAR) and an Apache 2.0 base model. You may use it commercially and non-commercially; attribution is required (see below). No share-alike obligation applies.

Base model

Training data

Produced by a chain of light continuation fine-tunes. Across the full lineage it was trained on the following datasets (and no others):

  • CUAD — CC BY 4.0. Hendrycks et al., "CUAD: An Expert-Annotated NLP Dataset for Legal Contract Review," NeurIPS 2021. The Atticus Project.
  • ACORD — CC BY 4.0. The Atticus Project, "ACORD: An Expert-Annotated Retrieval Dataset for Legal Contract Drafting," 2025.
  • LEDGAR (LexGLUE ledgar config) — CC BY 4.0. Tuggener et al., "LEDGAR," LREC 2020; derived from public-domain US SEC EDGAR filings. Chalkidis et al., "LexGLUE," ACL 2022.

Hard negatives were mined with BM25 from each dataset's own corpus. No MLEB / isaacus/contractual-clause-retrieval data and no web-scraped data were used in training — MLEB is used only as an evaluation benchmark.

Limitations

  • English-language commercial contracts (US-style); other jurisdictions/languages are out of distribution.
  • Late-interaction (multi-vector) storage is heavier per document than single-vector embedders.
  • The MLEB clause task is small (90 docs); treat ±1–2 points as noise.
  • Trained on a narrow set of clause types; confidence is lower on provision types far from the training taxonomy.

Acknowledgments

Full model architecture

ColBERT(
  (0): Transformer({'max_seq_length': 299, 'architecture': 'ModernBertModel'})
  (1): Dense({'in_features': 768, 'out_features': 128, 'bias': False, 'activation_function': 'Identity'})
)
Downloads last month
32
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for kmad00/legal-colbert-clause-retriever

Datasets used to train kmad00/legal-colbert-clause-retriever