Add legal-colbert clause retriever (P6b): MLEB Contractual Clause Retrieval 0.8338 NDCG@10

Browse files

Files changed (11) hide show

1_Dense/config.json +7 -0
1_Dense/model.safetensors +3 -0
README.md +122 -0
clause_size_vs_ndcg.png +0 -0
config.json +78 -0
config_sentence_transformers.json +53 -0
model.safetensors +3 -0
modules.json +14 -0
sentence_bert_config.json +4 -0
tokenizer.json +0 -0
tokenizer_config.json +23 -0

1_Dense/config.json ADDED Viewed

	@@ -0,0 +1,7 @@

+{
+    "in_features": 768,
+    "out_features": 128,
+    "bias": false,
+    "activation_function": "torch.nn.modules.linear.Identity",
+    "use_residual": false
+}

1_Dense/model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:ee7db427970339c14df30f7eace1dbf1f5ba9a4224ad55e963a227b8f8410f82
+size 393304

README.md ADDED Viewed

	@@ -0,0 +1,122 @@

+---
+license: cc-by-4.0
+pipeline_tag: sentence-similarity
+library_name: PyLate
+base_model: lightonai/GTE-ModernColBERT-v1
+datasets:
+- theatticusproject/cuad-qa
+- theatticusproject/acord
+- coastalcph/lex_glue
+language:
+- en
+tags:
+- ColBERT
+- PyLate
+- late-interaction
+- sentence-transformers
+- feature-extraction
+- legal
+- contracts
+- clause-retrieval
+- retrieval
+---
+# legal-colbert-clause-retriever
+A small, open **late-interaction (ColBERT)** retriever fine-tuned for **finding clauses in legal contracts** — termination, assignment, limitation of liability, IP ownership, non-compete, governing law, and ~35 other common provision types. It maps queries and contract passages to sequences of 128-d token vectors and scores them with the MaxSim operator.
+It is a continuation fine-tune of [`lightonai/GTE-ModernColBERT-v1`](https://huggingface.co/lightonai/GTE-ModernColBERT-v1) (149M params, ModernBERT-base backbone).
+## Results
+Evaluated on the **[MLEB](https://isaacus.com/mleb) Contractual Clause Retrieval** task (NDCG@10), the published benchmark for legal contract clause retrieval. Our evaluation reproduces the official leaderboard protocol exactly (BGE-M3 scores 0.7281 through our harness, matching the leaderboard to 4 decimals).
+| Metric | Score |
+|---|---|
+| **NDCG@10** | **0.8338** |
+| MAP | 0.7713 |
+| Recall@10 | 0.9556 |
+**At 149M parameters this is the best accuracy-per-parameter open model on the task** — 3rd of 17 open-source models, ahead of Google's EmbeddingGemma (308M, 0.829) and the same-size legal peer Free Law ModernBERT (0.764), and behind only Qwen3-Embedding-4B/8B (which are 27–53× larger).
+![Model size vs NDCG@10 on MLEB Contractual Clause Retrieval (open-source models)](clause_size_vs_ndcg.png)
+## Usage
+```bash
+pip install pylate
+```
+```python
+from pylate import models, rank
+model = models.ColBERT("kmad00/legal-colbert-clause-retriever")
+# Describe the clause you want to find
+queries = model.encode(
+    ["This is a contractual provision that limits the maximum liability a party can incur."],
+    is_query=True,
+)
+# Candidate contract passages
+documents = model.encode(
+    [
+        "In no event shall either party's aggregate liability exceed the fees paid in the prior twelve months...",
+        "This Agreement shall be governed by the laws of the State of Delaware...",
+    ],
+    is_query=False,
+)
+scores = rank.rerank(
+    documents_ids=[["0", "1"]],
+    queries_embeddings=queries,
+    documents_embeddings=[documents],
+)
+print(scores)
+```
+Queries can be plain clause names (`"governing law"`), natural-language definitions, or questions — the model is robust to phrasing. Document length 300 tokens, query length 48, output dim 128, similarity = MaxSim.
+## Supported clause types
+Trained and evaluated on 41 CUAD clause categories plus ACORD drafting queries and LEDGAR provision labels, including: Cap on Liability / Uncapped Liability, IP Ownership Assignment, Joint IP Ownership, License Grant, Non-Compete, Anti-Assignment, Change of Control, Governing Law, Termination for Convenience, Renewal Term, Audit Rights, Insurance, Most Favored Nation, Exclusivity, Liquidated Damages, Source Code Escrow, ROFR/ROFO/ROFN, and more. As a retriever (not a fixed classifier) it also generalizes to clause types outside this set.
+## License
+**CC BY 4.0.** This model is a derivative of CC BY 4.0 training data (CUAD, ACORD, LEDGAR) and an Apache 2.0 base model. You may use it commercially and non-commercially; attribution is required (see below). No share-alike obligation applies.
+## Base model
+- [lightonai/GTE-ModernColBERT-v1](https://huggingface.co/lightonai/GTE-ModernColBERT-v1) — Apache 2.0 (← [Alibaba-NLP/gte-modernbert-base](https://huggingface.co/Alibaba-NLP/gte-modernbert-base) ← [answerdotai/ModernBERT-base](https://huggingface.co/answerdotai/ModernBERT-base))
+## Training data
+Produced by a chain of light continuation fine-tunes. Across the full lineage it was trained on the following datasets (and no others):
+- **[CUAD](https://huggingface.co/datasets/theatticusproject/cuad-qa)** — CC BY 4.0. Hendrycks et al., "CUAD: An Expert-Annotated NLP Dataset for Legal Contract Review," NeurIPS 2021. The Atticus Project.
+- **[ACORD](https://huggingface.co/datasets/theatticusproject/acord)** — CC BY 4.0. The Atticus Project, "ACORD: An Expert-Annotated Retrieval Dataset for Legal Contract Drafting," 2025.
+- **[LEDGAR](https://huggingface.co/datasets/coastalcph/lex_glue)** (LexGLUE `ledgar` config) — CC BY 4.0. Tuggener et al., "LEDGAR," LREC 2020; derived from public-domain US SEC EDGAR filings. Chalkidis et al., "LexGLUE," ACL 2022.
+Hard negatives were mined with BM25 from each dataset's own corpus. **No MLEB / `isaacus/contractual-clause-retrieval` data and no web-scraped data were used in training** — MLEB is used only as an evaluation benchmark.
+## Limitations
+- English-language commercial contracts (US-style); other jurisdictions/languages are out of distribution.
+- Late-interaction (multi-vector) storage is heavier per document than single-vector embedders.
+- The MLEB clause task is small (90 docs); treat ±1–2 points as noise.
+- Trained on a narrow set of clause types; confidence is lower on provision types far from the training taxonomy.
+## Acknowledgments
+- Training data: [The Atticus Project](https://www.atticusprojectai.org/) (CUAD, ACORD); Tuggener et al. & [coastalcph/LexGLUE](https://github.com/coastalcph/lex-glue) (LEDGAR).
+- Base model: [LightOn](https://huggingface.co/lightonai) (GTE-ModernColBERT-v1), built with [PyLate](https://github.com/lightonai/pylate).
+- Benchmark: [Isaacus](https://isaacus.com/mleb) (MLEB) — evaluation only, not training.
+## Full model architecture
+```
+ColBERT(
+  (0): Transformer({'max_seq_length': 299, 'architecture': 'ModernBertModel'})
+  (1): Dense({'in_features': 768, 'out_features': 128, 'bias': False, 'activation_function': 'Identity'})
+)
+```

clause_size_vs_ndcg.png ADDED Viewed

config.json ADDED Viewed

	@@ -0,0 +1,78 @@

+{
+  "architectures": [
+    "ModernBertModel"
+  ],
+  "attention_bias": false,
+  "attention_dropout": 0.0,
+  "bos_token_id": 50281,
+  "classifier_activation": "gelu",
+  "classifier_bias": false,
+  "classifier_dropout": 0.0,
+  "classifier_pooling": "mean",
+  "cls_token_id": 50281,
+  "decoder_bias": true,
+  "deterministic_flash_attn": false,
+  "dtype": "float32",
+  "embedding_dropout": 0.0,
+  "eos_token_id": 50282,
+  "global_attn_every_n_layers": 3,
+  "gradient_checkpointing": false,
+  "hidden_activation": "gelu",
+  "hidden_size": 768,
+  "initializer_cutoff_factor": 2.0,
+  "initializer_range": 0.02,
+  "intermediate_size": 1152,
+  "layer_norm_eps": 1e-05,
+  "layer_types": [
+    "full_attention",
+    "sliding_attention",
+    "sliding_attention",
+    "full_attention",
+    "sliding_attention",
+    "sliding_attention",
+    "full_attention",
+    "sliding_attention",
+    "sliding_attention",
+    "full_attention",
+    "sliding_attention",
+    "sliding_attention",
+    "full_attention",
+    "sliding_attention",
+    "sliding_attention",
+    "full_attention",
+    "sliding_attention",
+    "sliding_attention",
+    "full_attention",
+    "sliding_attention",
+    "sliding_attention",
+    "full_attention"
+  ],
+  "local_attention": 128,
+  "max_position_embeddings": 8192,
+  "mlp_bias": false,
+  "mlp_dropout": 0.0,
+  "model_type": "modernbert",
+  "norm_bias": false,
+  "norm_eps": 1e-05,
+  "num_attention_heads": 12,
+  "num_hidden_layers": 22,
+  "pad_token_id": 50283,
+  "position_embedding_type": "absolute",
+  "repad_logits_with_grad": false,
+  "rope_parameters": {
+    "full_attention": {
+      "rope_theta": 160000.0,
+      "rope_type": "default"
+    },
+    "sliding_attention": {
+      "rope_theta": 10000.0,
+      "rope_type": "default"
+    }
+  },
+  "sep_token_id": 50282,
+  "sparse_pred_ignore_index": -100,
+  "sparse_prediction": false,
+  "tie_word_embeddings": true,
+  "transformers_version": "5.3.0",
+  "vocab_size": 50370
+}

config_sentence_transformers.json ADDED Viewed

	@@ -0,0 +1,53 @@

+{
+  "__version__": {
+    "sentence_transformers": "5.3.0",
+    "transformers": "5.3.0",
+    "pytorch": "2.9.0+cu128"
+  },
+  "prompts": {
+    "query": "",
+    "document": ""
+  },
+  "default_prompt_name": null,
+  "similarity_fn_name": "MaxSim",
+  "query_prefix": "[Q] ",
+  "document_prefix": "[D] ",
+  "query_length": 48,
+  "document_length": 300,
+  "attend_to_expansion_tokens": false,
+  "skiplist_words": [
+    "!",
+    "\"",
+    "#",
+    "$",
+    "%",
+    "&",
+    "'",
+    "(",
+    ")",
+    "*",
+    "+",
+    ",",
+    "-",
+    ".",
+    "/",
+    ":",
+    ";",
+    "<",
+    "=",
+    ">",
+    "?",
+    "@",
+    "[",
+    "\\",
+    "]",
+    "^",
+    "_",
+    "`",
+    "{",
+    "|",
+    "}",
+    "~"
+  ],
+  "do_query_expansion": false
+}

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:3a7fc1eed36a0b343e0a80b5e262bf93b04cb49a20e9b6e79a11b2df3e9777db
+size 596076280

modules.json ADDED Viewed

	@@ -0,0 +1,14 @@

+[
+  {
+    "idx": 0,
+    "name": "0",
+    "path": "",
+    "type": "sentence_transformers.models.Transformer"
+  },
+  {
+    "idx": 1,
+    "name": "1",
+    "path": "1_Dense",
+    "type": "pylate.models.Dense.Dense"
+  }
+]

sentence_bert_config.json ADDED Viewed

	@@ -0,0 +1,4 @@

+{
+    "max_seq_length": 299,
+    "do_lower_case": false
+}

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,23 @@

+{
+  "backend": "tokenizers",
+  "clean_up_tokenization_spaces": true,
+  "cls_token": "[CLS]",
+  "is_local": true,
+  "mask_token": "[MASK]",
+  "max_length": 299,
+  "model_input_names": [
+    "input_ids",
+    "attention_mask"
+  ],
+  "model_max_length": 299,
+  "pad_to_multiple_of": null,
+  "pad_token": "[MASK]",
+  "pad_token_type_id": 0,
+  "padding_side": "right",
+  "sep_token": "[SEP]",
+  "stride": 0,
+  "tokenizer_class": "TokenizersBackend",
+  "truncation_side": "right",
+  "truncation_strategy": "longest_first",
+  "unk_token": "[UNK]"
+}