legal-embeddings-bge-base

A BAAI/bge-base-en-v1.5 model fine-tuned on legal (query, passage) pairs from the LegalBench-RAG benchmark. Built for LegalKit, an open-source legal AI platform.

โš ๏ธ Honest Assessment

This model does not meaningfully outperform the base BGE model on held-out legal data. We are publishing it for reproducibility and transparency, not as a recommended improvement over the base model.

Our initial evaluation showed strong results (99% R@64), but this was evaluated on the same data distribution the model was trained on. A proper leave-one-dataset-out (LODO) cross-validation โ€” training on 3 of 4 benchmark datasets and evaluating on the held-out 4th โ€” tells a different story:

LODO Cross-Validation Results (the fair comparison)

Held-out Dataset Fine-tuned R@64 Base BGE R@64 ฮ”
CUAD 96.5% 95.7% +0.7pp
MAUD 80.7% 82.6% -1.9pp
ContractNLI 101.8% 101.8% +0.0pp
PrivacyQA 103.2% 103.2% +0.0pp
Average 95.5% 95.8% -0.3pp

The fine-tuned model is within noise of the base model โ€” and slightly worse on MAUD (M&A agreements). The high recall numbers (95%+) for both models are driven by our large paragraph-aware chunk size (~1500 chars), not embedding quality.

What happened

The training data (9,409 pairs) was extracted directly from LegalBench-RAG ground-truth annotations. When we evaluated on LegalBench-RAG, we were testing in-distribution โ€” the model had seen queries and documents from the same benchmark. The base BAAI/bge-base-en-v1.5 model is already excellent on legal text, and our fine-tuning added dataset-specific patterns rather than generalizable legal domain knowledge.

Recommendation

For most use cases, use the base BAAI/bge-base-en-v1.5 directly. It performs equivalently on held-out legal datasets without the complexity of a custom model. Invest in retrieval architecture (hybrid semantic + BM25, cross-reference expansion, reranking) rather than embedding fine-tuning โ€” that's where the real gains are.

Previous Claims (Retracted)

Our earlier model card stated this model "beats text-embedding-3-large." That comparison was based on in-distribution evaluation and is not supported by held-out results. We retract that claim. We have not conducted a fair head-to-head comparison with text-embedding-3-large using held-out data.

Model Description

A bi-encoder sentence embedding model that produces 768-dimensional dense embeddings. Fine-tuned with MultipleNegativesRankingLoss on legal (query, positive_passage, hard_negative) triples.

Usage

from sentence_transformers import SentenceTransformer
import numpy as np

model = SentenceTransformer("epequeno/legal-embeddings-bge-base")

query = "What is the indemnification cap under this agreement?"
passage = "The aggregate liability of Seller under Section 8.1 shall not exceed Five Million Dollars ($5,000,000)."

query_emb = model.encode(query)
passage_emb = model.encode(passage)

score = np.dot(query_emb, passage_emb) / (np.linalg.norm(query_emb) * np.linalg.norm(passage_emb))
print(f"Similarity: {score:.3f}")

Training Details

Parameter Value
Base model BAAI/bge-base-en-v1.5
Training examples 9,409 (query, positive, hard_negative) triples
Validation examples 1,045
Loss MultipleNegativesRankingLoss
Epochs 3
Batch size 32
Learning rate 2e-5
Max sequence length 256 tokens
GPU NVIDIA A10G
Training time ~13 minutes

Training Data

Positive pairs extracted from LegalBench-RAG ground-truth span annotations across 4 datasets:

Source Pairs Domain
CUAD 6,189 Commercial contracts (41 clause types)
MAUD 2,456 M&A merger agreements
ContractNLI 1,360 Non-disclosure agreements
PrivacyQA 449 Privacy policies

Hard negatives were mined from the same documents (nearby non-overlapping spans).

Limitations

  • Does not improve over the base model on held-out legal data (see LODO results above)
  • Optimized for US commercial contracts and privacy policies
  • Max sequence length is 256 tokens
  • Best used in hybrid retrieval (semantic + BM25), not semantic-only

Related Resources

Citation

@software{legalkit2026,
  author = {Pequeno, Steven},
  title = {LegalKit: Open-Source Legal AI Platform},
  year = {2026},
  url = {https://github.com/legalkit/legalkit}
}
Downloads last month
7
Safetensors
Model size
0.1B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for epequeno/legal-embeddings-bge-base

Finetuned
(471)
this model

Evaluation results

  • Recall@64 (avg, held-out) on LegalBench-RAG mini (held-out datasets)
    self-reported
    0.955
  • Recall@16 (avg, held-out) on LegalBench-RAG mini (held-out datasets)
    self-reported
    0.798