metadata
license_file: LICENSE.md
library_name: protonx-text-correction
tags:
- text-to-text
language:
- vi
Introduction
ProtonX Legal Text Correction (v1.2-NC)
A specialized Vietnamese correction model engineered for high-accuracy OCR post-processing, especially to fix noisy PaddleOCR outputs in enterprise and legal workflows.
Best Use Case (Primary Focus): Fixing PaddleOCR text errors
The model is optimized to clean up real-world OCR mistakes such as:
- missing or incorrect diacritics
- broken word segmentation
- misrecognized legal terms
- punctuation artifacts
- formatting inconsistencies
Built on a Seq2Seq Transformer architecture, the model is trained on 70,000 correction pairs, including 20,000 pairs manually annotated by expert Vietnamese annotators, covering:
- official legal documents
- OCR outputs from scanned PDFs
- colloquial → standardized legal text
Strict constraints ensure:
- Correction ≠ rewriting
- meaning of legal text must never change
- no hallucination / no added legal terms
- confidence-based correction
- no paraphrasing
LICENSE
This model is released under the ProtonX Text Correction Model License (v1.2-NC).
See LICENSE.md for full terms, conditions, and usage restrictions.
Highlights
- ROUGE-L: 98.44
- Achieved on the ProtonX Legal Correction Validation Dataset. The evaluation dataset will be released in an upcoming public release.
Quick Usage with Transformers
import torch
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
model_path = "protonx-models/protonx-legal-tc"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForSeq2SeqLM.from_pretrained(model_path)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
model.eval()
examples = [
"can cu bo luat lao dong 2019 va cac van ban huong dan thuc hien.",
]
for text in examples:
inputs = tokenizer(
text,
return_tensors="pt",
truncation=True,
max_length=128
).to(device)
with torch.no_grad():
outputs = model.generate(
**inputs,
num_beams=10,
max_new_tokens=32,
length_penalty=1.0,
early_stopping=True,
repetition_penalty=1.2,
no_repeat_ngram_size=2,
pad_token_id=tokenizer.pad_token_id,
eos_token_id=tokenizer.eos_token_id,
)
result = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(f"Input: {text}")
print(f"Output: {result}")
print("-" * 30)
Benchmark
ProtonX Legal Text Correction Validation Dataset
| Metric | Score |
|---|---|
| ROUGE-L | 98.44 |
Training Details
- Model: Seq2Seq Transformer
- Legal-domain augmentation
- Beam search decoding
- Max sequence length: 64 tokens total (32 tokens for input and 32 tokens for output).
- High-precision diacritic + punctuation restoration
Domain Coverage:
- Government decrees
- Resolutions
- Contract clauses
- Administrative procedures
- OCR-normalized scanned documents
Example Outputs
Input:
Cǎn cú Hién pháp nuóc Cōng hòa xā hi chù nghia Viēt Nam;
Output:
Căn cứ Hiến pháp nước Cộng hòa xã hội chủ nghĩa Việt Nam;
Use Cases
- Legal OCR text normalization
- Standardizing government documents
- Contract proofreading
- Preprocessing for legal RAG systems
- Administrative workflow automation
- Compliance document processing
Limitations
- Does not paraphrase or rewrite legal clauses
- Cannot restore missing semantic content
- Primarily optimized for Vietnamese
- Not designed for informal social media slang
Future Work
- Achieving even higher ROUGE-L performance on legal-domain datasets
- Extending maximum sequence length from 64 to 256 tokens for long-clause legal documents
Acknowledgments
Thanks to: