--- license: apache-2.0 base_model: Qwen/Qwen3.5-9B tags: - legal - entity-extraction - lora - peft - contract-analysis language: - en datasets: - theatticusproject/cuad-qa pipeline_tag: text-generation --- # OpenBrief Nano — Legal Entity Extraction Model A LoRA fine-tuned adapter on top of [Qwen3.5-9B](https://huggingface.co/Qwen/Qwen3.5-9B) for extracting structured entities from legal contract text. Built as part of [OpenBrief](https://github.com/illuminator22/openbrief), an open-source legal document intelligence platform by BrainX Corp. ## What It Does Given a chunk of legal contract text, OpenBrief Nano returns structured JSON with 8 entity types: - **party** — named legal entities, organizations, persons - **date** — agreement, effective, expiration dates (with subtypes) - **deadline** — notice periods, renewal terms, payment windows (with subtypes) - **jurisdiction** — venue, forum, arbitration seat - **obligation** — duties imposed on a party - **monetary_amount** — dollar values, liability caps, liquidated damages (with subtypes) - **governing_law** — applicable law - **termination_right** — rights to terminate (with subtypes) ## Usage ```python from transformers import AutoModelForCausalLM, AutoTokenizer from peft import PeftModel import torch, json # Load base model + adapter base_model = AutoModelForCausalLM.from_pretrained( "Qwen/Qwen3.5-9B", torch_dtype=torch.bfloat16, device_map="auto", trust_remote_code=True, ) model = PeftModel.from_pretrained(base_model, "illuminator22/openbrief-nano-qwen35-entity-extractor") model.eval() tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3.5-9B", trust_remote_code=True) # Build prompt chunk = "Your legal text here..." prompt = ( "<|im_start|>system\n" "You are OpenBrief Nano, a legal entity extraction model. " "Return only valid JSON. Do not include explanations, commentary, or hidden reasoning.<|im_end|>\n" "<|im_start|>user\n" "Extract entities from this legal text:\n\n" f"{chunk}<|im_end|>\n" "<|im_start|>assistant\n" ) inputs = tokenizer(prompt, return_tensors="pt").to(model.device) with torch.no_grad(): outputs = model.generate(**inputs, max_new_tokens=1024, do_sample=False, eos_token_id=tokenizer.eos_token_id, pad_token_id=tokenizer.pad_token_id) gen_tokens = outputs[0][inputs["input_ids"].shape[1]:] result = json.loads(tokenizer.decode(gen_tokens, skip_special_tokens=True)) print(json.dumps(result, indent=2)) ``` ## Evaluation: OpenBrief Nano vs GPT-5.4 Nano ### Test Setup - 50 chunks from 51 held-out CUAD contracts (contract-level split, no data leakage) - 25 chunks manually reviewed with gold labels for per-type accuracy - Both models evaluated on identical chunks ### Overall Metrics | Metric | OpenBrief Nano | GPT-5.4 Nano (prompted) | |---|---|---| | Valid JSON rate | 100% | 100% | | Total entities extracted (45 comparable chunks) | 407 | 386 | ### Per-Type Extraction Volume (45 comparable chunks) | Entity Type | OpenBrief Nano | GPT-5.4 Nano | Winner | |---|---|---|---| | party | 196 | 146 | **Nano** | | date | 67 | 57 | **Nano** | | deadline | 30 | 13 | **Nano** | | jurisdiction | 1 | 13 | GPT-Nano | | obligation | 94 | 132 | GPT-Nano | | monetary_amount | 6 | 12 | GPT-Nano | | governing_law | 3 | 3 | Tie | | termination_right | 10 | 10 | Tie | ### Per-Type Accuracy (25 gold-reviewed chunks) Based on manual expert review of 25 held-out chunks, comparing each model's output against human-verified gold labels: | Entity Type | Nano Wins | GPT-Nano Wins | Tie | Assessment | |---|---|---|---|---| | party | 12 | 4 | 9 | **Nano stronger** — finds more real parties, occasionally over-extracts | | date | 5 | 3 | 17 | **Nano slightly better** — sometimes over-extracts temporal references | | deadline | 7 | 1 | 17 | **Nano clearly stronger** — catches notice periods and payment windows GPT misses | | jurisdiction | 0 | 5 | 20 | **GPT-Nano stronger** — Nano's biggest weakness | | obligation | 2 | 6 | 17 | **GPT-Nano stronger** — finds more, but sometimes misclassifies rights as duties | | monetary_amount | 3 | 2 | 20 | **Roughly even** — both sparse | | governing_law | 0 | 0 | 25 | **Tied** — both correct when present | | termination_right | 5 | 2 | 18 | **Nano stronger** — catches termination rights GPT misses | ### Key Findings **OpenBrief Nano wins on:** party extraction (12-4), deadline detection (7-1), termination rights (5-2). These are high-frequency, CUAD-anchored entity types where training data was strongest. **GPT-5.4 Nano wins on:** jurisdiction (0-5), obligation volume (2-6). Jurisdiction was fully GPT-assisted during training with limited examples. Obligation extraction finds fewer entities but with better party attribution. **Both models struggle with:** jurisdiction in general (sparse in contracts), monetary amounts (often redacted in CUAD source data). **Qualitative observations:** - Nano produces better party attribution in obligations (uses defined party names) - GPT-Nano occasionally extracts signature blocks as obligations and documents as parties - Nano occasionally over-extracts section titles and generic temporal references as entities - Both models correctly return empty arrays when no entities are present ### Cost Comparison | Model | Cost per chunk | Monthly cost (1000 docs) | Self-hosted | |---|---|---|---| | OpenBrief Nano | **$0.00** | **$0.00** | Yes | | GPT-5.4 Nano | ~$0.002 | ~$40-60 | No | | GPT-5.4 | ~$0.02 | ~$400-600 | No | ## Training Details - **Base model:** Qwen/Qwen3.5-9B - **Method:** LoRA (bf16, not QLoRA — Qwen3.5 architecture has higher quantization loss at 4-bit) - **Training data:** 708 contract chunks from 407 CUAD contracts - **Labeling:** GPT-5.4 with CUAD expert annotations as anchors (hybrid silver labeling) - **LoRA config:** rank=32, alpha=64, dropout=0.05 - **Target modules:** q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj - **Epochs:** 3 - **Learning rate:** 2e-4 (cosine schedule) - **Final training loss:** 0.62 - **Hardware:** NVIDIA A100-SXM4-80GB (Google Colab) - **Trainable parameters:** 58M / 9B (0.65%) ### Training Data Pipeline 1. 510 CUAD commercial contracts split at contract level (407 train / 51 val / 51 test) 2. Contracts chunked into 512-token windows with 50-token overlap 3. CUAD expert annotations matched to chunks as anchors for 6 entity types (party, date, governing_law, termination_right, deadline, monetary_amount) 4. GPT-5.4 used for schema conversion, gap-filling, and labeling 2 additional types (obligation, jurisdiction) 5. Post-processing cleanup to remove generic party references and noisy labels 6. Balanced sampling: 558 anchor-rich chunks + 150 no-anchor chunks = 708 training examples ### Label Reliability Tiers | Tier | Entity Types | Label Source | Quality | |---|---|---|---| | Tier 1 (high confidence) | party, date, governing_law | CUAD expert annotations + GPT formatting | High | | Tier 2 (anchor-partial) | deadline, termination_right, monetary_amount | CUAD clause anchors + GPT extraction | Medium-High | | Tier 3 (GPT-assisted) | obligation, jurisdiction | GPT-5.4 only | Medium | ## Limitations - **Jurisdiction extraction is weak** (0 wins vs GPT-Nano in gold review) — limited training signal - **Obligation extraction lags GPT** — fewer examples found, though party attribution is often better - **English only** — trained exclusively on English commercial contracts from CUAD - **Contract-focused** — not trained on court opinions, legislation, or other legal document types - **Silver labels** — training data labeled by GPT-5.4 with CUAD anchors. Model inherits systematic biases from the labeling model - **Occasional over-extraction** — may extract section titles, generic temporal references, or product names as entities - **Recommended max_new_tokens: 1024** — use 1024 or higher for entity-dense chunks to avoid output truncation ## Known Issues and Future Improvements - Add more jurisdiction training examples from contracts with explicit forum selection clauses - Add court case chunks to expand beyond commercial contracts - Post-processing regex to strip think blocks as safety net (model was trained without thinking mode) ## Project OpenBrief Nano is part of [OpenBrief](https://github.com/illuminator22/openbrief), an open-source legal document intelligence platform featuring agentic RAG, persistent memory, multi-provider LLM support, and transparent evaluation metrics. **Author:** Ivan Arshakyan / BrainX Corp ## Citation ``` @misc{openbrief-nano-2026, title={OpenBrief Nano: LoRA Fine-Tuned Legal Entity Extraction on Qwen3.5-9B}, author={Ivan Arshakyan}, year={2026}, publisher={HuggingFace}, url={https://huggingface.co/illuminator22/openbrief-nano-qwen35-entity-extractor} } ```