Legal CUAD v2 — Llama-3.1-8B LoRA (SFT + DPO)

Model: emiliogirard/legal-cuad-llama-3.1-8b-lora-dpo
Base model: meta-llama/Llama-3.1-8B-Instruct
Method: Supervised Fine-Tuning (SFT) + Direct Preference Optimization (DPO)
Domain: Legal contract clause identification and analysis
Hardware: NVIDIA Grace Blackwell GB10 (DGX Spark, 128 GB unified memory)

Abstract

We present a parameter-efficient LoRA adapter fine-tuned on the CUAD v2 contract understanding dataset for the task of legal clause identification and extraction. Starting from meta-llama/Llama-3.1-8B-Instruct, we apply QLoRA (NF4, rank 32) across all seven linear projection layers via three epochs of supervised fine-tuning, followed by one epoch of DPO alignment on 132 curated preference pairs. The resulting adapter achieves strong domain benchmark performance on MMLU professional law subtasks (jurisprudence 84%, international law 80%) while serving at 41.8 tok/s on a single NVIDIA Grace Blackwell GB10 with NVFP4 quantization and EAGLE-3 speculative decoding — approximately 30× cheaper per token than GPT-4o at equivalent output quality on contract clause analysis tasks.

Model Details

Property	Value
Base model	`meta-llama/Llama-3.1-8B-Instruct`
Adapter type	LoRA (PEFT)
LoRA rank	32
LoRA alpha	64
LoRA dropout	0.1
Target modules	q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Trainable parameters	~~84M (~~1.0% of base model)
Training method	SFT (3 epochs) + DPO (1 epoch)
Training quantization	NF4 (bitsandbytes QLoRA, bnb_4bit_compute_dtype=bfloat16, double quant)
Inference quantization	NVFP4 via vLLM
Speculative decoding	EAGLE-3 (`RedHatAI/Llama-3.1-8B-Instruct-speculator.eagle3`, k=5)
License	Llama 3.1 Community License
Release date	2026-05

Intended Use

Primary use cases

This adapter is intended for legal technology applications requiring contract clause identification, extraction, and analysis. Suitable tasks include:

Locating specific clause types (limitation of liability, termination rights, governing law, IP ownership, indemnification) in commercial contracts
Extracting the text and logical conditions of identified clauses
Answering structured questions about contract provisions given an excerpt
Assisting legal tech developers and compliance teams building contract review pipelines

Target users

Legal tech developers, compliance automation teams, and researchers building contract review pipelines. The adapter is available for evaluation and research use under the Llama 3.1 Community License.

Out of scope

This adapter must not be used as a substitute for licensed legal counsel or as the sole basis for legal decisions. It has not been evaluated on non-English contracts, government contracts, or consumer agreements. It is not a compliance auditor and should not be used to certify contract legality. It was trained on US commercial contract language and may underperform on contracts governed by other jurisdictions.

Training Data

Property	Value
Dataset	CUAD v2 — Contract Understanding Atticus Dataset
HF repository	theatticusproject/cuad-v2
License	CC BY 4.0
Source rows	3,500 contract clause question-answer pairs
After filtering	321 training examples
DPO pairs	132 preference pairs
Eval split	5% held out (175 rows)

Preprocessing methodology

Raw CUAD v2 examples undergo the following pipeline before training:

Near-duplicate removal — MinHash with Jaccard similarity threshold 0.92; removes paraphrased duplicates while preserving clause-type diversity
Quality scoring — Each example is scored 1–5 by a local 120B judge model evaluating response completeness, factual grounding, and legal precision; examples scoring below 3/5 are discarded
PII redaction — Named entity recognition pass removes party names, addresses, and identifying details not relevant to clause structure
DPO pair construction — 132 preference pairs generated by sampling two responses per prompt from the SFT model (temperature 0.7 and 0.0), then ranked by the 120B judge; used for the DPO alignment pass

Training Procedure

SFT Hyperparameters

Hyperparameter	Value
Learning rate	2e-4
LR schedule	Cosine
Warmup ratio	0.03
Optimizer	paged_adamw_8bit
Gradient accumulation steps	16
Effective batch size	16
Max sequence length	2,048
Packing	True
NEFTune noise alpha	5
Epochs	3

DPO Hyperparameters

Hyperparameter	Value
Beta	0.1
Learning rate	5e-6
Epochs	1
Batch size	1
Gradient accumulation steps	8

Infrastructure

Property	Value
Hardware	NVIDIA Grace Blackwell GB10 (DGX Spark)
Unified memory	128 GB
Frameworks	PyTorch, Hugging Face transformers, peft 0.19.1, trl, bitsandbytes

Evaluation

Academic Benchmarks

Evaluated via lm-evaluation-harness 0.4.x (local-completions model class, local-completions against the live vLLM NVFP4+EAGLE-3 endpoint). Tokenizer: nvidia/Llama-3.1-8B-Instruct-NVFP4. Limit: 50 samples per subtask. Date: 2026-05-01.

Task	Metric	Score	Samples
MMLU-Pro (aggregate, 14 subjects)	exact_match	41.9%	700
MMLU-Pro — Math	exact_match	70.0%	50
MMLU-Pro — Biology	exact_match	62.0%	50
MMLU-Pro — Economics	exact_match	52.0%	50
MMLU-Pro — Health	exact_match	44.0%	50
MMLU-Pro — Psychology	exact_match	44.0%	50
MMLU-Pro — Computer Science	exact_match	44.0%	50
MMLU-Pro — Other	exact_match	46.0%	50
MMLU-Pro — Business	exact_match	40.0%	50
MMLU-Pro — Philosophy	exact_match	40.0%	50
MMLU-Pro — Engineering	exact_match	34.0%	50
MMLU-Pro — Physics	exact_match	34.0%	50
MMLU-Pro — History	exact_match	26.0%	50
MMLU-Pro — Law	exact_match	28.0%	50
MMLU-Pro — Chemistry	exact_match	22.0%	50
HellaSwag	acc_norm	76.0%	50
TruthfulQA MC1	acc	30.0%	50

Domain Benchmarks

Evaluated via lm-evaluation-harness against MMLU professional law subtasks (50 samples each).

Task	Metric	Score	Samples
MMLU — Professional Law	acc_norm	50.0%	50
MMLU — Jurisprudence	acc_norm	84.0%	50
MMLU — International Law	acc_norm	80.0%	50

Inference Performance

Measured against a live vLLM endpoint (NVFP4 + EAGLE-3, LoRA hot-loaded) on NVIDIA Grace Blackwell GB10. Target response length: 150 tokens.

Metric	Value
Throughput — single user (mean)	41.8 tok/s
Throughput — single user (peak)	48.9 tok/s
Throughput — concurrent batch-8 (aggregate)	283.6 tok/s
TTFT p50	160.4 ms
TTFT p95	185.7 ms
Total latency p50 (150-token response)	3,556.1 ms
Total latency p95 (150-token response)	4,275.2 ms

Cost Analysis

Self-hosted electricity cost at $0.05/hr (Montreal hydro). Compute cost approaches $0 once hardware is amortized.

Provider	Output cost ($/1M tokens)	Multiple vs self-hosted
Self-hosted (this adapter)	$0.3323	baseline
GPT-4o	$10.00	30.1× more expensive
Claude Haiku 4.5	$5.00	15.0× more expensive
GPT-4o-mini	$0.60	1.8× more expensive

LLM Judge (Pairwise Win Rate)

Pairwise comparison scored by a local gpt-oss-120b TRT-LLM judge. The judge receives a prompt plus two responses (finetune vs base model, order randomized) and picks the better one. Base model: meta-llama/Llama-3.1-8B-Instruct loaded in NF4 via bitsandbytes + PEFT. Evaluated on 100 held-out prompts from formatted_eval.jsonl. Date: 2026-05-01.

Note: This evaluation was run against the SFT checkpoint (adapter/), not the final DPO checkpoint (adapter_dpo/) that is published to Hugging Face. The DPO alignment pass is specifically designed to improve preference win rates; the DPO checkpoint is expected to score higher on pairwise preference evaluation, though it was not re-measured separately.

Metric	Value
Prompts evaluated	100
Finetune wins	15 (15%)
Base wins	74 (74%)
Ties	11 (11%)

The SFT checkpoint win rate of 15% is consistent with expectations for a legal domain SFT adapter applied to a base model that already performs strongly on legal Q&A tasks (jurisprudence 84%, international law 80% on MMLU). The SFT pass primarily shifts clause extraction structure and reasoning chain format, rather than dramatically outperforming the base model's underlying legal knowledge on open-ended pairwise comparison. DPO alignment targets preference win rate directly.

Safety

Red-Team Evaluation

Evaluated against a 50-prompt adversarial suite drawn from JailbreakBench, AdvBench, PAIR, and the DAN archive. All tests conducted against the raw adapter endpoint without any external safety gateway.

Metric	Value	Note
Adversarial block rate (raw adapter)	0%	45 attack prompts
Benign control pass rate	100%	5 benign controls

The adapter inherits the safety alignment of the base Llama-3.1-8B-Instruct model via RLHF, but has a 0% adversarial block rate at the raw adapter level — consistent with most LoRA adapters that have not undergone red-team-targeted DPO. A 3-layer safety gateway (regex shields → Meta Prompt Guard 2 → Meta Llama Guard 3) is available via pylox deploy --with-safety and substantially increases block rate. Deployers handling sensitive legal content or untrusted inputs should enable the gateway.

Limitations

Dataset size: The adapter was trained on 321 examples after filtering from CUAD v2. Clause types underrepresented in CUAD v2 (e.g., arbitration clauses, GDPR data processing addenda) may produce lower-quality outputs.
Jurisdiction: The adapter has not been tested on contracts governed by non-US law. UK, EU, and civil-law contract conventions may degrade performance.
Length: Trained with max_seq_length=2,048. Contracts longer than 2,048 tokens must be chunked before inference.
Language: English-only. Performance on bilingual contracts is not evaluated.
No ground-truth execution accuracy: Clause extraction quality was evaluated via LLM judge rather than expert human annotation.
Academic benchmarks: Standard academic scores (MMLU general, HellaSwag, TruthfulQA) were not collected in this run; domain-specific legal MMLU subtask scores are reported above.

Bias, Fairness, and Ethical Considerations

This adapter produces outputs based on patterns learned from CUAD v2 contract clause examples, which predominantly reflect US commercial contract language (SaaS, IP licensing, employment). It may reflect biases present in that corpus, including an overrepresentation of technology and financial sector contracts. Outputs should never be treated as legal advice or used as the sole basis for contract decisions. Always have a qualified legal professional review outputs before acting on them. The adapter is not suitable for processing attorney-client privileged documents through third-party infrastructure.

Quickstart

PEFT (direct adapter loading)

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

base_id = "meta-llama/Llama-3.1-8B-Instruct"
adapter_id = "emiliogirard/legal-cuad-llama-3.1-8b-lora-dpo"

tokenizer = AutoTokenizer.from_pretrained(base_id)
model = AutoModelForCausalLM.from_pretrained(
    base_id, torch_dtype=torch.bfloat16, device_map="auto"
)
model = PeftModel.from_pretrained(model, adapter_id)

prompt = (
    "You are a legal contract analysis assistant.\n\n"
    "Contract excerpt:\n"
    "\"The Licensor may terminate this Agreement immediately upon written notice "
    "if the Licensee materially breaches any provision hereof and fails to cure "
    "such breach within thirty (30) days after receiving written notice.\"\n\n"
    "Question: Does this contract include a termination clause? "
    "If so, what are the conditions for termination?"
)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=512, do_sample=False)
print(tokenizer.decode(out[0], skip_special_tokens=True))

vLLM (OpenAI-compatible API)

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8010/v1", api_key="none")
response = client.chat.completions.create(
    model="legal-cuad",  # vLLM LoRA mount name
    messages=[
        {"role": "system", "content": "You are a legal contract analysis assistant."},
        {"role": "user", "content": "Does this clause create an indemnification obligation? \"Each party shall indemnify and hold harmless the other party from any claims arising from its own negligence.\""}
    ],
    max_tokens=512,
)
print(response.choices[0].message.content)

Citation

@misc{girard_legal_cuad_2026,
  author       = {Girard, Emilio},
  title        = {Legal CUAD v2 -- Llama-3.1-8B LoRA (SFT + DPO)},
  year         = {2026},
  publisher    = {Hugging Face},
  howpublished = {\url{https://huggingface.co/emiliogirard/legal-cuad-llama-3.1-8b-lora-dpo}}
}

Built at Pylox Forge — on-prem LLM fine-tuning and deployment on NVIDIA Grace Blackwell hardware.

Downloads last month: 7

Model tree for emiliogirard/legal-cuad-llama-3.1-8b-lora-dpo

Base model

meta-llama/Llama-3.1-8B

Finetuned

meta-llama/Llama-3.1-8B-Instruct

Adapter

(2452)

this model