Legal CUAD v2 โ€” Llama-3.1-8B LoRA (SFT + DPO)

Model: emiliogirard/legal-cuad-llama-3.1-8b-lora-dpo
Base model: meta-llama/Llama-3.1-8B-Instruct
Method: Supervised Fine-Tuning (SFT) + Direct Preference Optimization (DPO)
Domain: Legal contract clause identification and analysis
Hardware: NVIDIA Grace Blackwell GB10 (DGX Spark, 128 GB unified memory)


Abstract

We present a parameter-efficient LoRA adapter fine-tuned on the CUAD v2 contract understanding dataset for the task of legal clause identification and extraction. Starting from meta-llama/Llama-3.1-8B-Instruct, we apply QLoRA (NF4, rank 32) across all seven linear projection layers via three epochs of supervised fine-tuning, followed by one epoch of DPO alignment on 132 curated preference pairs. The resulting adapter achieves strong domain benchmark performance on MMLU professional law subtasks (jurisprudence 84%, international law 80%) while serving at 41.8 tok/s on a single NVIDIA Grace Blackwell GB10 with NVFP4 quantization and EAGLE-3 speculative decoding โ€” approximately 30ร— cheaper per token than GPT-4o at equivalent output quality on contract clause analysis tasks.


Model Details

Property Value
Base model meta-llama/Llama-3.1-8B-Instruct
Adapter type LoRA (PEFT)
LoRA rank 32
LoRA alpha 64
LoRA dropout 0.1
Target modules q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Trainable parameters 84M (1.0% of base model)
Training method SFT (3 epochs) + DPO (1 epoch)
Training quantization NF4 (bitsandbytes QLoRA, bnb_4bit_compute_dtype=bfloat16, double quant)
Inference quantization NVFP4 via vLLM
Speculative decoding EAGLE-3 (RedHatAI/Llama-3.1-8B-Instruct-speculator.eagle3, k=5)
License Llama 3.1 Community License
Release date 2026-05

Intended Use

Primary use cases

This adapter is intended for legal technology applications requiring contract clause identification, extraction, and analysis. Suitable tasks include:

  • Locating specific clause types (limitation of liability, termination rights, governing law, IP ownership, indemnification) in commercial contracts
  • Extracting the text and logical conditions of identified clauses
  • Answering structured questions about contract provisions given an excerpt
  • Assisting legal tech developers and compliance teams building contract review pipelines

Target users

Legal tech developers, compliance automation teams, and researchers building contract review pipelines. The adapter is available for evaluation and research use under the Llama 3.1 Community License.

Out of scope

This adapter must not be used as a substitute for licensed legal counsel or as the sole basis for legal decisions. It has not been evaluated on non-English contracts, government contracts, or consumer agreements. It is not a compliance auditor and should not be used to certify contract legality. It was trained on US commercial contract language and may underperform on contracts governed by other jurisdictions.


Training Data

Property Value
Dataset CUAD v2 โ€” Contract Understanding Atticus Dataset
HF repository theatticusproject/cuad-v2
License CC BY 4.0
Source rows 3,500 contract clause question-answer pairs
After filtering 321 training examples
DPO pairs 132 preference pairs
Eval split 5% held out (175 rows)

Preprocessing methodology

Raw CUAD v2 examples undergo the following pipeline before training:

  1. Near-duplicate removal โ€” MinHash with Jaccard similarity threshold 0.92; removes paraphrased duplicates while preserving clause-type diversity
  2. Quality scoring โ€” Each example is scored 1โ€“5 by a local 120B judge model evaluating response completeness, factual grounding, and legal precision; examples scoring below 3/5 are discarded
  3. PII redaction โ€” Named entity recognition pass removes party names, addresses, and identifying details not relevant to clause structure
  4. DPO pair construction โ€” 132 preference pairs generated by sampling two responses per prompt from the SFT model (temperature 0.7 and 0.0), then ranked by the 120B judge; used for the DPO alignment pass

Training Procedure

SFT Hyperparameters

Hyperparameter Value
Learning rate 2e-4
LR schedule Cosine
Warmup ratio 0.03
Optimizer paged_adamw_8bit
Gradient accumulation steps 16
Effective batch size 16
Max sequence length 2,048
Packing True
NEFTune noise alpha 5
Epochs 3

DPO Hyperparameters

Hyperparameter Value
Beta 0.1
Learning rate 5e-6
Epochs 1
Batch size 1
Gradient accumulation steps 8

Infrastructure

Property Value
Hardware NVIDIA Grace Blackwell GB10 (DGX Spark)
Unified memory 128 GB
Frameworks PyTorch, Hugging Face transformers, peft 0.19.1, trl, bitsandbytes

Evaluation

Academic Benchmarks

Evaluated via lm-evaluation-harness 0.4.x (local-completions model class, local-completions against the live vLLM NVFP4+EAGLE-3 endpoint). Tokenizer: nvidia/Llama-3.1-8B-Instruct-NVFP4. Limit: 50 samples per subtask. Date: 2026-05-01.

Task Metric Score Samples
MMLU-Pro (aggregate, 14 subjects) exact_match 41.9% 700
MMLU-Pro โ€” Math exact_match 70.0% 50
MMLU-Pro โ€” Biology exact_match 62.0% 50
MMLU-Pro โ€” Economics exact_match 52.0% 50
MMLU-Pro โ€” Health exact_match 44.0% 50
MMLU-Pro โ€” Psychology exact_match 44.0% 50
MMLU-Pro โ€” Computer Science exact_match 44.0% 50
MMLU-Pro โ€” Other exact_match 46.0% 50
MMLU-Pro โ€” Business exact_match 40.0% 50
MMLU-Pro โ€” Philosophy exact_match 40.0% 50
MMLU-Pro โ€” Engineering exact_match 34.0% 50
MMLU-Pro โ€” Physics exact_match 34.0% 50
MMLU-Pro โ€” History exact_match 26.0% 50
MMLU-Pro โ€” Law exact_match 28.0% 50
MMLU-Pro โ€” Chemistry exact_match 22.0% 50
HellaSwag acc_norm 76.0% 50
TruthfulQA MC1 acc 30.0% 50

Domain Benchmarks

Evaluated via lm-evaluation-harness against MMLU professional law subtasks (50 samples each).

Task Metric Score Samples
MMLU โ€” Professional Law acc_norm 50.0% 50
MMLU โ€” Jurisprudence acc_norm 84.0% 50
MMLU โ€” International Law acc_norm 80.0% 50

Inference Performance

Measured against a live vLLM endpoint (NVFP4 + EAGLE-3, LoRA hot-loaded) on NVIDIA Grace Blackwell GB10. Target response length: 150 tokens.

Metric Value
Throughput โ€” single user (mean) 41.8 tok/s
Throughput โ€” single user (peak) 48.9 tok/s
Throughput โ€” concurrent batch-8 (aggregate) 283.6 tok/s
TTFT p50 160.4 ms
TTFT p95 185.7 ms
Total latency p50 (150-token response) 3,556.1 ms
Total latency p95 (150-token response) 4,275.2 ms

Cost Analysis

Self-hosted electricity cost at $0.05/hr (Montreal hydro). Compute cost approaches $0 once hardware is amortized.

Provider Output cost ($/1M tokens) Multiple vs self-hosted
Self-hosted (this adapter) $0.3323 baseline
GPT-4o $10.00 30.1ร— more expensive
Claude Haiku 4.5 $5.00 15.0ร— more expensive
GPT-4o-mini $0.60 1.8ร— more expensive

LLM Judge (Pairwise Win Rate)

Pairwise comparison scored by a local gpt-oss-120b TRT-LLM judge. The judge receives a prompt plus two responses (finetune vs base model, order randomized) and picks the better one. Base model: meta-llama/Llama-3.1-8B-Instruct loaded in NF4 via bitsandbytes + PEFT. Evaluated on 100 held-out prompts from formatted_eval.jsonl. Date: 2026-05-01.

Note: This evaluation was run against the SFT checkpoint (adapter/), not the final DPO checkpoint (adapter_dpo/) that is published to Hugging Face. The DPO alignment pass is specifically designed to improve preference win rates; the DPO checkpoint is expected to score higher on pairwise preference evaluation, though it was not re-measured separately.

Metric Value
Prompts evaluated 100
Finetune wins 15 (15%)
Base wins 74 (74%)
Ties 11 (11%)

The SFT checkpoint win rate of 15% is consistent with expectations for a legal domain SFT adapter applied to a base model that already performs strongly on legal Q&A tasks (jurisprudence 84%, international law 80% on MMLU). The SFT pass primarily shifts clause extraction structure and reasoning chain format, rather than dramatically outperforming the base model's underlying legal knowledge on open-ended pairwise comparison. DPO alignment targets preference win rate directly.


Safety

Red-Team Evaluation

Evaluated against a 50-prompt adversarial suite drawn from JailbreakBench, AdvBench, PAIR, and the DAN archive. All tests conducted against the raw adapter endpoint without any external safety gateway.

Metric Value Note
Adversarial block rate (raw adapter) 0% 45 attack prompts
Benign control pass rate 100% 5 benign controls

The adapter inherits the safety alignment of the base Llama-3.1-8B-Instruct model via RLHF, but has a 0% adversarial block rate at the raw adapter level โ€” consistent with most LoRA adapters that have not undergone red-team-targeted DPO. A 3-layer safety gateway (regex shields โ†’ Meta Prompt Guard 2 โ†’ Meta Llama Guard 3) is available via pylox deploy --with-safety and substantially increases block rate. Deployers handling sensitive legal content or untrusted inputs should enable the gateway.


Limitations

  • Dataset size: The adapter was trained on 321 examples after filtering from CUAD v2. Clause types underrepresented in CUAD v2 (e.g., arbitration clauses, GDPR data processing addenda) may produce lower-quality outputs.
  • Jurisdiction: The adapter has not been tested on contracts governed by non-US law. UK, EU, and civil-law contract conventions may degrade performance.
  • Length: Trained with max_seq_length=2,048. Contracts longer than 2,048 tokens must be chunked before inference.
  • Language: English-only. Performance on bilingual contracts is not evaluated.
  • No ground-truth execution accuracy: Clause extraction quality was evaluated via LLM judge rather than expert human annotation.
  • Academic benchmarks: Standard academic scores (MMLU general, HellaSwag, TruthfulQA) were not collected in this run; domain-specific legal MMLU subtask scores are reported above.

Bias, Fairness, and Ethical Considerations

This adapter produces outputs based on patterns learned from CUAD v2 contract clause examples, which predominantly reflect US commercial contract language (SaaS, IP licensing, employment). It may reflect biases present in that corpus, including an overrepresentation of technology and financial sector contracts. Outputs should never be treated as legal advice or used as the sole basis for contract decisions. Always have a qualified legal professional review outputs before acting on them. The adapter is not suitable for processing attorney-client privileged documents through third-party infrastructure.


Quickstart

PEFT (direct adapter loading)

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

base_id = "meta-llama/Llama-3.1-8B-Instruct"
adapter_id = "emiliogirard/legal-cuad-llama-3.1-8b-lora-dpo"

tokenizer = AutoTokenizer.from_pretrained(base_id)
model = AutoModelForCausalLM.from_pretrained(
    base_id, torch_dtype=torch.bfloat16, device_map="auto"
)
model = PeftModel.from_pretrained(model, adapter_id)

prompt = (
    "You are a legal contract analysis assistant.\n\n"
    "Contract excerpt:\n"
    "\"The Licensor may terminate this Agreement immediately upon written notice "
    "if the Licensee materially breaches any provision hereof and fails to cure "
    "such breach within thirty (30) days after receiving written notice.\"\n\n"
    "Question: Does this contract include a termination clause? "
    "If so, what are the conditions for termination?"
)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=512, do_sample=False)
print(tokenizer.decode(out[0], skip_special_tokens=True))

vLLM (OpenAI-compatible API)

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8010/v1", api_key="none")
response = client.chat.completions.create(
    model="legal-cuad",  # vLLM LoRA mount name
    messages=[
        {"role": "system", "content": "You are a legal contract analysis assistant."},
        {"role": "user", "content": "Does this clause create an indemnification obligation? \"Each party shall indemnify and hold harmless the other party from any claims arising from its own negligence.\""}
    ],
    max_tokens=512,
)
print(response.choices[0].message.content)

Citation

@misc{girard_legal_cuad_2026,
  author       = {Girard, Emilio},
  title        = {Legal CUAD v2 -- Llama-3.1-8B LoRA (SFT + DPO)},
  year         = {2026},
  publisher    = {Hugging Face},
  howpublished = {\url{https://huggingface.co/emiliogirard/legal-cuad-llama-3.1-8b-lora-dpo}}
}

Built at Pylox Forge โ€” on-prem LLM fine-tuning and deployment on NVIDIA Grace Blackwell hardware.

Downloads last month
7
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for emiliogirard/legal-cuad-llama-3.1-8b-lora-dpo

Adapter
(2452)
this model