Instructions to use emiliogirard/legal-cuad-llama-3.1-8b-lora-dpo with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use emiliogirard/legal-cuad-llama-3.1-8b-lora-dpo with PEFT:
from peft import PeftModel from transformers import AutoModelForCausalLM base_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B-Instruct") model = PeftModel.from_pretrained(base_model, "emiliogirard/legal-cuad-llama-3.1-8b-lora-dpo") - Notebooks
- Google Colab
- Kaggle
Legal CUAD v2 โ Llama-3.1-8B LoRA (SFT + DPO)
Model: emiliogirard/legal-cuad-llama-3.1-8b-lora-dpo
Base model: meta-llama/Llama-3.1-8B-Instruct
Method: Supervised Fine-Tuning (SFT) + Direct Preference Optimization (DPO)
Domain: Legal contract clause identification and analysis
Hardware: NVIDIA Grace Blackwell GB10 (DGX Spark, 128 GB unified memory)
Abstract
We present a parameter-efficient LoRA adapter fine-tuned on the CUAD v2 contract understanding dataset for the task of legal clause identification and extraction. Starting from meta-llama/Llama-3.1-8B-Instruct, we apply QLoRA (NF4, rank 32) across all seven linear projection layers via three epochs of supervised fine-tuning, followed by one epoch of DPO alignment on 132 curated preference pairs. The resulting adapter achieves strong domain benchmark performance on MMLU professional law subtasks (jurisprudence 84%, international law 80%) while serving at 41.8 tok/s on a single NVIDIA Grace Blackwell GB10 with NVFP4 quantization and EAGLE-3 speculative decoding โ approximately 30ร cheaper per token than GPT-4o at equivalent output quality on contract clause analysis tasks.
Model Details
| Property | Value |
|---|---|
| Base model | meta-llama/Llama-3.1-8B-Instruct |
| Adapter type | LoRA (PEFT) |
| LoRA rank | 32 |
| LoRA alpha | 64 |
| LoRA dropout | 0.1 |
| Target modules | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
| Trainable parameters | |
| Training method | SFT (3 epochs) + DPO (1 epoch) |
| Training quantization | NF4 (bitsandbytes QLoRA, bnb_4bit_compute_dtype=bfloat16, double quant) |
| Inference quantization | NVFP4 via vLLM |
| Speculative decoding | EAGLE-3 (RedHatAI/Llama-3.1-8B-Instruct-speculator.eagle3, k=5) |
| License | Llama 3.1 Community License |
| Release date | 2026-05 |
Intended Use
Primary use cases
This adapter is intended for legal technology applications requiring contract clause identification, extraction, and analysis. Suitable tasks include:
- Locating specific clause types (limitation of liability, termination rights, governing law, IP ownership, indemnification) in commercial contracts
- Extracting the text and logical conditions of identified clauses
- Answering structured questions about contract provisions given an excerpt
- Assisting legal tech developers and compliance teams building contract review pipelines
Target users
Legal tech developers, compliance automation teams, and researchers building contract review pipelines. The adapter is available for evaluation and research use under the Llama 3.1 Community License.
Out of scope
This adapter must not be used as a substitute for licensed legal counsel or as the sole basis for legal decisions. It has not been evaluated on non-English contracts, government contracts, or consumer agreements. It is not a compliance auditor and should not be used to certify contract legality. It was trained on US commercial contract language and may underperform on contracts governed by other jurisdictions.
Training Data
| Property | Value |
|---|---|
| Dataset | CUAD v2 โ Contract Understanding Atticus Dataset |
| HF repository | theatticusproject/cuad-v2 |
| License | CC BY 4.0 |
| Source rows | 3,500 contract clause question-answer pairs |
| After filtering | 321 training examples |
| DPO pairs | 132 preference pairs |
| Eval split | 5% held out (175 rows) |
Preprocessing methodology
Raw CUAD v2 examples undergo the following pipeline before training:
- Near-duplicate removal โ MinHash with Jaccard similarity threshold 0.92; removes paraphrased duplicates while preserving clause-type diversity
- Quality scoring โ Each example is scored 1โ5 by a local 120B judge model evaluating response completeness, factual grounding, and legal precision; examples scoring below 3/5 are discarded
- PII redaction โ Named entity recognition pass removes party names, addresses, and identifying details not relevant to clause structure
- DPO pair construction โ 132 preference pairs generated by sampling two responses per prompt from the SFT model (temperature 0.7 and 0.0), then ranked by the 120B judge; used for the DPO alignment pass
Training Procedure
SFT Hyperparameters
| Hyperparameter | Value |
|---|---|
| Learning rate | 2e-4 |
| LR schedule | Cosine |
| Warmup ratio | 0.03 |
| Optimizer | paged_adamw_8bit |
| Gradient accumulation steps | 16 |
| Effective batch size | 16 |
| Max sequence length | 2,048 |
| Packing | True |
| NEFTune noise alpha | 5 |
| Epochs | 3 |
DPO Hyperparameters
| Hyperparameter | Value |
|---|---|
| Beta | 0.1 |
| Learning rate | 5e-6 |
| Epochs | 1 |
| Batch size | 1 |
| Gradient accumulation steps | 8 |
Infrastructure
| Property | Value |
|---|---|
| Hardware | NVIDIA Grace Blackwell GB10 (DGX Spark) |
| Unified memory | 128 GB |
| Frameworks | PyTorch, Hugging Face transformers, peft 0.19.1, trl, bitsandbytes |
Evaluation
Academic Benchmarks
Evaluated via lm-evaluation-harness 0.4.x (local-completions model class, local-completions against the live vLLM NVFP4+EAGLE-3 endpoint). Tokenizer: nvidia/Llama-3.1-8B-Instruct-NVFP4. Limit: 50 samples per subtask. Date: 2026-05-01.
| Task | Metric | Score | Samples |
|---|---|---|---|
| MMLU-Pro (aggregate, 14 subjects) | exact_match | 41.9% | 700 |
| MMLU-Pro โ Math | exact_match | 70.0% | 50 |
| MMLU-Pro โ Biology | exact_match | 62.0% | 50 |
| MMLU-Pro โ Economics | exact_match | 52.0% | 50 |
| MMLU-Pro โ Health | exact_match | 44.0% | 50 |
| MMLU-Pro โ Psychology | exact_match | 44.0% | 50 |
| MMLU-Pro โ Computer Science | exact_match | 44.0% | 50 |
| MMLU-Pro โ Other | exact_match | 46.0% | 50 |
| MMLU-Pro โ Business | exact_match | 40.0% | 50 |
| MMLU-Pro โ Philosophy | exact_match | 40.0% | 50 |
| MMLU-Pro โ Engineering | exact_match | 34.0% | 50 |
| MMLU-Pro โ Physics | exact_match | 34.0% | 50 |
| MMLU-Pro โ History | exact_match | 26.0% | 50 |
| MMLU-Pro โ Law | exact_match | 28.0% | 50 |
| MMLU-Pro โ Chemistry | exact_match | 22.0% | 50 |
| HellaSwag | acc_norm | 76.0% | 50 |
| TruthfulQA MC1 | acc | 30.0% | 50 |
Domain Benchmarks
Evaluated via lm-evaluation-harness against MMLU professional law subtasks (50 samples each).
| Task | Metric | Score | Samples |
|---|---|---|---|
| MMLU โ Professional Law | acc_norm | 50.0% | 50 |
| MMLU โ Jurisprudence | acc_norm | 84.0% | 50 |
| MMLU โ International Law | acc_norm | 80.0% | 50 |
Inference Performance
Measured against a live vLLM endpoint (NVFP4 + EAGLE-3, LoRA hot-loaded) on NVIDIA Grace Blackwell GB10. Target response length: 150 tokens.
| Metric | Value |
|---|---|
| Throughput โ single user (mean) | 41.8 tok/s |
| Throughput โ single user (peak) | 48.9 tok/s |
| Throughput โ concurrent batch-8 (aggregate) | 283.6 tok/s |
| TTFT p50 | 160.4 ms |
| TTFT p95 | 185.7 ms |
| Total latency p50 (150-token response) | 3,556.1 ms |
| Total latency p95 (150-token response) | 4,275.2 ms |
Cost Analysis
Self-hosted electricity cost at $0.05/hr (Montreal hydro). Compute cost approaches $0 once hardware is amortized.
| Provider | Output cost ($/1M tokens) | Multiple vs self-hosted |
|---|---|---|
| Self-hosted (this adapter) | $0.3323 | baseline |
| GPT-4o | $10.00 | 30.1ร more expensive |
| Claude Haiku 4.5 | $5.00 | 15.0ร more expensive |
| GPT-4o-mini | $0.60 | 1.8ร more expensive |
LLM Judge (Pairwise Win Rate)
Pairwise comparison scored by a local gpt-oss-120b TRT-LLM judge. The judge receives a prompt plus two responses (finetune vs base model, order randomized) and picks the better one. Base model: meta-llama/Llama-3.1-8B-Instruct loaded in NF4 via bitsandbytes + PEFT. Evaluated on 100 held-out prompts from formatted_eval.jsonl. Date: 2026-05-01.
Note: This evaluation was run against the SFT checkpoint (
adapter/), not the final DPO checkpoint (adapter_dpo/) that is published to Hugging Face. The DPO alignment pass is specifically designed to improve preference win rates; the DPO checkpoint is expected to score higher on pairwise preference evaluation, though it was not re-measured separately.
| Metric | Value |
|---|---|
| Prompts evaluated | 100 |
| Finetune wins | 15 (15%) |
| Base wins | 74 (74%) |
| Ties | 11 (11%) |
The SFT checkpoint win rate of 15% is consistent with expectations for a legal domain SFT adapter applied to a base model that already performs strongly on legal Q&A tasks (jurisprudence 84%, international law 80% on MMLU). The SFT pass primarily shifts clause extraction structure and reasoning chain format, rather than dramatically outperforming the base model's underlying legal knowledge on open-ended pairwise comparison. DPO alignment targets preference win rate directly.
Safety
Red-Team Evaluation
Evaluated against a 50-prompt adversarial suite drawn from JailbreakBench, AdvBench, PAIR, and the DAN archive. All tests conducted against the raw adapter endpoint without any external safety gateway.
| Metric | Value | Note |
|---|---|---|
| Adversarial block rate (raw adapter) | 0% | 45 attack prompts |
| Benign control pass rate | 100% | 5 benign controls |
The adapter inherits the safety alignment of the base Llama-3.1-8B-Instruct model via RLHF, but has a 0% adversarial block rate at the raw adapter level โ consistent with most LoRA adapters that have not undergone red-team-targeted DPO. A 3-layer safety gateway (regex shields โ Meta Prompt Guard 2 โ Meta Llama Guard 3) is available via pylox deploy --with-safety and substantially increases block rate. Deployers handling sensitive legal content or untrusted inputs should enable the gateway.
Limitations
- Dataset size: The adapter was trained on 321 examples after filtering from CUAD v2. Clause types underrepresented in CUAD v2 (e.g., arbitration clauses, GDPR data processing addenda) may produce lower-quality outputs.
- Jurisdiction: The adapter has not been tested on contracts governed by non-US law. UK, EU, and civil-law contract conventions may degrade performance.
- Length: Trained with max_seq_length=2,048. Contracts longer than 2,048 tokens must be chunked before inference.
- Language: English-only. Performance on bilingual contracts is not evaluated.
- No ground-truth execution accuracy: Clause extraction quality was evaluated via LLM judge rather than expert human annotation.
- Academic benchmarks: Standard academic scores (MMLU general, HellaSwag, TruthfulQA) were not collected in this run; domain-specific legal MMLU subtask scores are reported above.
Bias, Fairness, and Ethical Considerations
This adapter produces outputs based on patterns learned from CUAD v2 contract clause examples, which predominantly reflect US commercial contract language (SaaS, IP licensing, employment). It may reflect biases present in that corpus, including an overrepresentation of technology and financial sector contracts. Outputs should never be treated as legal advice or used as the sole basis for contract decisions. Always have a qualified legal professional review outputs before acting on them. The adapter is not suitable for processing attorney-client privileged documents through third-party infrastructure.
Quickstart
PEFT (direct adapter loading)
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
base_id = "meta-llama/Llama-3.1-8B-Instruct"
adapter_id = "emiliogirard/legal-cuad-llama-3.1-8b-lora-dpo"
tokenizer = AutoTokenizer.from_pretrained(base_id)
model = AutoModelForCausalLM.from_pretrained(
base_id, torch_dtype=torch.bfloat16, device_map="auto"
)
model = PeftModel.from_pretrained(model, adapter_id)
prompt = (
"You are a legal contract analysis assistant.\n\n"
"Contract excerpt:\n"
"\"The Licensor may terminate this Agreement immediately upon written notice "
"if the Licensee materially breaches any provision hereof and fails to cure "
"such breach within thirty (30) days after receiving written notice.\"\n\n"
"Question: Does this contract include a termination clause? "
"If so, what are the conditions for termination?"
)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=512, do_sample=False)
print(tokenizer.decode(out[0], skip_special_tokens=True))
vLLM (OpenAI-compatible API)
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8010/v1", api_key="none")
response = client.chat.completions.create(
model="legal-cuad", # vLLM LoRA mount name
messages=[
{"role": "system", "content": "You are a legal contract analysis assistant."},
{"role": "user", "content": "Does this clause create an indemnification obligation? \"Each party shall indemnify and hold harmless the other party from any claims arising from its own negligence.\""}
],
max_tokens=512,
)
print(response.choices[0].message.content)
Citation
@misc{girard_legal_cuad_2026,
author = {Girard, Emilio},
title = {Legal CUAD v2 -- Llama-3.1-8B LoRA (SFT + DPO)},
year = {2026},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/emiliogirard/legal-cuad-llama-3.1-8b-lora-dpo}}
}
Built at Pylox Forge โ on-prem LLM fine-tuning and deployment on NVIDIA Grace Blackwell hardware.
- Downloads last month
- 7
Model tree for emiliogirard/legal-cuad-llama-3.1-8b-lora-dpo
Base model
meta-llama/Llama-3.1-8B