Update readme
Browse files
README.md
CHANGED
|
@@ -27,23 +27,21 @@ High-Accuracy Vietnamese Legal Document Correction
|
|
| 27 |
|
| 28 |
## **Introduction**
|
| 29 |
|
| 30 |
-
**ProtonX Legal Text Correction (v1.2-NC)**
|
| 31 |
|
| 32 |
-
* legal
|
| 33 |
-
* government and administrative documents
|
| 34 |
-
* contracts, decrees, circulars
|
| 35 |
-
* OCR post-processing
|
| 36 |
-
* enterprise compliance systems
|
| 37 |
-
* archival digitization workflows
|
| 38 |
|
| 39 |
-
|
| 40 |
|
| 41 |
-
|
| 42 |
-
|
| 43 |
-
|
| 44 |
-
|
| 45 |
-
*
|
| 46 |
-
*
|
|
|
|
|
|
|
|
|
|
| 47 |
|
| 48 |
Built on a Seq2Seq Transformer architecture, the model is trained on 70,000 correction pairs, including 20,000 pairs manually annotated by expert Vietnamese annotators, covering:
|
| 49 |
|
|
@@ -67,10 +65,8 @@ Strict constraints ensure:
|
|
| 67 |
1. **ROUGE-L: 98.44**
|
| 68 |
- Achieved on the ProtonX Legal Correction Validation Dataset. The evaluation dataset will be released in an upcoming public release.
|
| 69 |
|
| 70 |
-
2. Fixing and normalizing PaddleOCR output to ensure high-quality downstream correction
|
| 71 |
|
| 72 |
|
| 73 |
-
<img src="https://protonx.co/assets/img/paddle-ocr-protonx.png">
|
| 74 |
|
| 75 |
---
|
| 76 |
|
|
|
|
| 27 |
|
| 28 |
## **Introduction**
|
| 29 |
|
| 30 |
+
### **ProtonX Legal Text Correction (v1.2-NC)**
|
| 31 |
|
| 32 |
+
A **specialized Vietnamese correction model** engineered for **high-accuracy OCR post-processing**, especially **to fix noisy PaddleOCR outputs** in enterprise and legal workflows.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 33 |
|
| 34 |
+
#### **Best Use Case (Primary Focus)**: **Fixing PaddleOCR text errors**
|
| 35 |
|
| 36 |
+
<img src="https://protonx.co/assets/img/paddle-ocr-protonx.png">
|
| 37 |
+
|
| 38 |
+
The model is optimized to clean up real-world OCR mistakes such as:
|
| 39 |
+
|
| 40 |
+
* missing or incorrect diacritics
|
| 41 |
+
* broken word segmentation
|
| 42 |
+
* misrecognized legal terms
|
| 43 |
+
* punctuation artifacts
|
| 44 |
+
* formatting inconsistencies
|
| 45 |
|
| 46 |
Built on a Seq2Seq Transformer architecture, the model is trained on 70,000 correction pairs, including 20,000 pairs manually annotated by expert Vietnamese annotators, covering:
|
| 47 |
|
|
|
|
| 65 |
1. **ROUGE-L: 98.44**
|
| 66 |
- Achieved on the ProtonX Legal Correction Validation Dataset. The evaluation dataset will be released in an upcoming public release.
|
| 67 |
|
|
|
|
| 68 |
|
| 69 |
|
|
|
|
| 70 |
|
| 71 |
---
|
| 72 |
|