protonx-legal-tc / README.md
ngoc's picture
add colab link
2bbc28a
|
Raw
History Blame
5.45 kB
---
license_file: LICENSE.md
library_name: protonx-text-correction
tags:
- text-to-text
language:
- vi
---
<div align="center">
<p align="center">
<img src="https://storage.googleapis.com/mle-courses-prod/users/61b6fa1ba83a7e37c8309756/private-files/678dadd0-603b-11ef-b0a7-998b84b38d43-ProtonX_logo_horizontally__1_.png" width="260"/>
</p>
<h1 align="center">
High-Accuracy Vietnamese Text Correction v1.3
</h1>
[![GitHub](https://img.shields.io/badge/ProtonX-GitHub-black?logo=github)](https://github.com/protonx-engineering/protonx-text-correction)
[![HuggingFace](https://img.shields.io/badge/HuggingFace-Model-black?logo=huggingface)](https://huggingface.co/protonx-models/protonx-tc)
[![Website](https://img.shields.io/badge/protonx.co-Website-blue)](https://protonx.co)
[![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/17m37QYMG4LO6oyMdkTxNtFzQW8uWDd_-?usp=sharing)
</div>
---
## **Introduction**
<img src="https://storage.googleapis.com/mle-courses-prod/users/61b6fa1ba83a7e37c8309756/private-files/1795a9d0-cb4d-11f0-a59b-27096d42dd86-Screen_Shot_2025-11-27_at_11.53.12.png">
### **ProtonX Text Correction (v1.3-NC)**
A specialized Vietnamese text correction model engineered for high-accuracy normalization of legal and enterprise text. Optimized for OCR post-processing (including PaddleOCR outputs), but also capable of cleaning broader Vietnamese text with diacritic restoration, segmentation repair, and correction of domain-specific terminology.
<img src="https://protonx.co/assets/img/paddle-ocr-protonx.png">
The model is optimized to clean up real-world OCR mistakes such as:
* missing or incorrect diacritics
* broken word segmentation
* misrecognized legal terms
* punctuation artifacts
* formatting inconsistencies
Built on a Seq2Seq Transformer architecture, the model is trained on 70,000 correction pairs, including 20,000 pairs manually annotated by expert Vietnamese annotators, covering:
* official legal documents
* OCR outputs from scanned PDFs
* colloquial → standardized legal text
Strict constraints ensure:
* **Correction ≠ rewriting**
* meaning of legal text must never change
* no hallucination / no added legal terms
* confidence-based correction
* no paraphrasing
---
## **LICENSE**
This model is released under the ProtonX Text Correction Model License (v1.3-NC).
See [LICENSE.md](./LICENSE.md) for full terms, conditions, and usage restrictions.
## **Current Version**: v1.3
## **Highlights**
1. **ROUGE-L: Coming soon**
- Achieved on the ProtonX Legal Correction Validation Dataset. The evaluation dataset will be released in an upcoming public release.
- Extended maximum sequence length from 32 tokens in v1.2 to 128 tokens in this release.
---
## **Quick Usage with Transformers**
```python
import torch
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
model_path = "protonx-models/protonx-legal-tc"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForSeq2SeqLM.from_pretrained(model_path)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
model.eval()
examples = [
"can cu bo luat lao dong 2019 va cac van ban huong dan thuc hien.",
]
for text in examples:
inputs = tokenizer(
text,
return_tensors="pt",
truncation=True,
max_length=128
).to(device)
with torch.no_grad():
outputs = model.generate(
**inputs,
num_beams=10,
max_new_tokens=128,
length_penalty=1.0,
early_stopping=True,
repetition_penalty=1.2,
no_repeat_ngram_size=2,
pad_token_id=tokenizer.pad_token_id,
eos_token_id=tokenizer.eos_token_id,
)
result = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(f"Input: {text}")
print(f"Output: {result}")
print("-" * 30)
```
---
## **Benchmark**
### **ProtonX Legal Text Correction Validation Dataset**
| Metric | Score |
| ------------- | --------- |
| **ROUGE-L** | **Coming soon** |
---
## **Training Details**
* Model: Seq2Seq Transformer
* Legal-domain augmentation
* Beam search decoding
* Max sequence length: 256 tokens total (128 tokens for input and 128 tokens for output).
* High-precision diacritic + punctuation restoration
### Domain Coverage:
* Government decrees
* Resolutions
* Contract clauses
* Administrative procedures
* OCR-normalized scanned documents
---
## **Example Outputs**
**Input:**
```
Cǎn cú Hién pháp nuóc Cōng hòa xā hi chù nghia Viēt Nam;
```
**Output:**
```
Căn cứ Hiến pháp nước Cộng hòa xã hội chủ nghĩa Việt Nam;
```
---
## **Use Cases**
* Legal OCR text normalization
* Standardizing government documents
* Contract proofreading
* Preprocessing for legal RAG systems
* Administrative workflow automation
* Compliance document processing
---
## **Limitations**
* Does not paraphrase or rewrite legal clauses
* Cannot restore missing semantic content
* Primarily optimized for Vietnamese
* Not designed for informal social media slang
---
## **Future Work**
* Achieving even higher ROUGE-L performance on legal-domain datasets
* Extending maximum sequence length from 128 to 1024 tokens for long-clause legal documents
---
## **Acknowledgments**
Thanks to:
* [vit5-base](https://huggingface.co/VietAI/vit5-base)