license_file: LICENSE.md
library_name: protonx-text-correction
tags:
- text-to-text
language:
- vi
Introduction
ProtonX Text Correction (v1.3-NC)
A specialized Vietnamese text correction model engineered for high-accuracy normalization of legal and enterprise text. Optimized for OCR post-processing (including PaddleOCR outputs), but also capable of cleaning broader Vietnamese text with diacritic restoration, segmentation repair, and correction of domain-specific terminology.
The model is optimized to clean up real-world OCR mistakes such as:
- missing or incorrect diacritics
- broken word segmentation
- misrecognized legal terms
- punctuation artifacts
- formatting inconsistencies
Built on a Seq2Seq Transformer architecture, the model is trained on 70,000 correction pairs, including 20,000 pairs manually annotated by expert Vietnamese annotators, covering:
- official legal documents
- OCR outputs from scanned PDFs
- colloquial → standardized legal text
Strict constraints ensure:
- Correction ≠ rewriting
- meaning of legal text must never change
- no hallucination / no added legal terms
- confidence-based correction
- no paraphrasing
LICENSE
This model is released under the ProtonX Text Correction Model License (v1.3-NC).
See LICENSE.md for full terms, conditions, and usage restrictions.
Current Version: v1.3
Highlights
- ROUGE-L: Coming soon
- Achieved on the ProtonX Legal Correction Validation Dataset. The evaluation dataset will be released in an upcoming public release.
- Extended maximum sequence length from 32 tokens in v1.2 to 128 tokens in this release.
Quick Usage with Transformers
import torch
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
model_path = "protonx-models/protonx-legal-tc"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForSeq2SeqLM.from_pretrained(model_path)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
model.eval()
examples = [
"can cu bo luat lao dong 2019 va cac van ban huong dan thuc hien.",
]
for text in examples:
inputs = tokenizer(
text,
return_tensors="pt",
truncation=True,
max_length=128
).to(device)
with torch.no_grad():
outputs = model.generate(
**inputs,
num_beams=10,
max_new_tokens=128,
length_penalty=1.0,
early_stopping=True,
repetition_penalty=1.2,
no_repeat_ngram_size=2,
pad_token_id=tokenizer.pad_token_id,
eos_token_id=tokenizer.eos_token_id,
)
result = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(f"Input: {text}")
print(f"Output: {result}")
print("-" * 30)
Benchmark
ProtonX Legal Text Correction Validation Dataset
| Metric | Score |
|---|---|
| ROUGE-L | Coming soon |
Training Details
- Model: Seq2Seq Transformer
- Legal-domain augmentation
- Beam search decoding
- Max sequence length: 256 tokens total (128 tokens for input and 128 tokens for output).
- High-precision diacritic + punctuation restoration
Domain Coverage:
- Government decrees
- Resolutions
- Contract clauses
- Administrative procedures
- OCR-normalized scanned documents
Example Outputs
Input:
Cǎn cú Hién pháp nuóc Cōng hòa xā hi chù nghia Viēt Nam;
Output:
Căn cứ Hiến pháp nước Cộng hòa xã hội chủ nghĩa Việt Nam;
Use Cases
- Legal OCR text normalization
- Standardizing government documents
- Contract proofreading
- Preprocessing for legal RAG systems
- Administrative workflow automation
- Compliance document processing
Limitations
- Does not paraphrase or rewrite legal clauses
- Cannot restore missing semantic content
- Primarily optimized for Vietnamese
- Not designed for informal social media slang
Future Work
- Achieving even higher ROUGE-L performance on legal-domain datasets
- Extending maximum sequence length from 128 to 1024 tokens for long-clause legal documents
Acknowledgments
Thanks to: