File size: 4,908 Bytes
b9f53b5
 
45b90e4
b9f53b5
5206995
b9f53b5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5261e13
b9f53b5
5261e13
b9f53b5
5261e13
b9f53b5
5261e13
 
 
 
 
 
 
 
 
b9f53b5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b67c079
 
 
 
 
 
b9f53b5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b67c079
b9f53b5
 
 
 
 
b67c079
b9f53b5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
---
license_file: LICENSE.md
library_name: protonx-text-correction
tags:
- text-to-text
language:
- vi
---

<div align="center">

<p align="center">
    <img src="https://storage.googleapis.com/mle-courses-prod/users/61b6fa1ba83a7e37c8309756/private-files/678dadd0-603b-11ef-b0a7-998b84b38d43-ProtonX_logo_horizontally__1_.png" width="260"/>
</p>

<h1 align="center">
High-Accuracy Vietnamese Legal Document Correction
</h1>

[![GitHub](https://img.shields.io/badge/ProtonX-GitHub-black?logo=github)](https://github.com/protonx-engineering/protonx-text-correction)
[![HuggingFace](https://img.shields.io/badge/HuggingFace-Model-black?logo=huggingface)](https://huggingface.co/protonx-models/protonx-tc)
[![Website](https://img.shields.io/badge/protonx.co-Website-blue)](https://protonx.co)

</div>

---

## **Introduction**

### **ProtonX Legal Text Correction (v1.2-NC)**

A **specialized Vietnamese correction model** engineered for **high-accuracy OCR post-processing**, especially **to fix noisy PaddleOCR outputs** in enterprise and legal workflows.

#### **Best Use Case (Primary Focus)**: **Fixing PaddleOCR text errors** 

<img src="https://protonx.co/assets/img/paddle-ocr-protonx.png">

The model is optimized to clean up real-world OCR mistakes such as:

* missing or incorrect diacritics
* broken word segmentation
* misrecognized legal terms
* punctuation artifacts
* formatting inconsistencies

Built on a Seq2Seq Transformer architecture, the model is trained on 70,000 correction pairs, including 20,000 pairs manually annotated by expert Vietnamese annotators, covering:

* official legal documents
* OCR outputs from scanned PDFs
* colloquial → standardized legal text

Strict constraints ensure:

* **Correction ≠ rewriting**
* meaning of legal text must never change
* no hallucination / no added legal terms
* confidence-based correction
* no paraphrasing

---

## **LICENSE**

This model is released under the ProtonX Text Correction Model License (v1.2-NC).

See [LICENSE.md](./LICENSE.md) for full terms, conditions, and usage restrictions.

## **Highlights**


1. **ROUGE-L: 98.44**
- Achieved on the ProtonX Legal Correction Validation Dataset. The evaluation dataset will be released in an upcoming public release.




---

## **Quick Usage with Transformers**

```python
import torch
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model_path = "protonx-models/protonx-legal-tc"

tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForSeq2SeqLM.from_pretrained(model_path)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
model.eval()

examples = [
    "can cu bo luat lao dong 2019 va cac van ban huong dan thuc hien.",
]

for text in examples:
    inputs = tokenizer(
        text,
        return_tensors="pt",
        truncation=True,
        max_length=128
    ).to(device)

    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            num_beams=10,
            max_new_tokens=32,
            length_penalty=1.0,
            early_stopping=True,
            repetition_penalty=1.2,
            no_repeat_ngram_size=2,
            pad_token_id=tokenizer.pad_token_id,
            eos_token_id=tokenizer.eos_token_id,
        )

    result = tokenizer.decode(outputs[0], skip_special_tokens=True)

    print(f"Input:  {text}")
    print(f"Output: {result}")
    print("-" * 30)
```

---

## **Benchmark**

### **ProtonX Legal Text Correction Validation Dataset**

| Metric        | Score     |
| ------------- | --------- |
| **ROUGE-L**   | **98.44** |

---


## **Training Details**

* Model: Seq2Seq Transformer
* Legal-domain augmentation
* Beam search decoding
* Max sequence length: 64 tokens total (32 tokens for input and 32 tokens for output).
* High-precision diacritic + punctuation restoration

### Domain Coverage:

* Government decrees
* Resolutions
* Contract clauses
* Administrative procedures
* OCR-normalized scanned documents

---

## **Example Outputs**


**Input:**

```
Cǎn cú Hién pháp nuóc Cōng hòa xā hi chù nghia Viēt Nam;
```

**Output:**

```
Căn cứ Hiến pháp nước Cộng hòa xã hội chủ nghĩa Việt Nam;
```

---

## **Use Cases**

* Legal OCR text normalization
* Standardizing government documents
* Contract proofreading
* Preprocessing for legal RAG systems
* Administrative workflow automation
* Compliance document processing

---

## **Limitations**

* Does not paraphrase or rewrite legal clauses
* Cannot restore missing semantic content
* Primarily optimized for Vietnamese
* Not designed for informal social media slang

---

## **Future Work**

* Achieving even higher ROUGE-L performance on legal-domain datasets
* Extending maximum sequence length from 64 to 256 tokens for long-clause legal documents
---

## **Acknowledgments**

Thanks to:

* [vit5-base](https://huggingface.co/VietAI/vit5-base)