protonx-models
/

protonx-legal-tc

protonx-text-correction

Model card Files Files and versions

ngoc commited on Nov 21, 2025

Commit

5261e13

·

1 Parent(s): 5206995

Update readme

Files changed (1) hide show

README.md +12 -16

README.md CHANGED Viewed

@@ -27,23 +27,21 @@ High-Accuracy Vietnamese Legal Document Correction
 ## **Introduction**
-**ProtonX Legal Text Correction (v1.2-NC)** is a **specialized Vietnamese correction model** optimized for:
-* legal texts
-* government and administrative documents
-* contracts, decrees, circulars
-* OCR post-processing
-* enterprise compliance systems
-* archival digitization workflows
-It corrects:
-* spelling
-* punctuation
-* diacritics
-* grammar
-* legal terminology normalization
-* formal writing style issues
 Built on a Seq2Seq Transformer architecture, the model is trained on 70,000 correction pairs, including 20,000 pairs manually annotated by expert Vietnamese annotators, covering:
@@ -67,10 +65,8 @@ Strict constraints ensure:
 1. **ROUGE-L: 98.44**
 - Achieved on the ProtonX Legal Correction Validation Dataset. The evaluation dataset will be released in an upcoming public release.
-2. Fixing and normalizing PaddleOCR output to ensure high-quality downstream correction
-<img src="https://protonx.co/assets/img/paddle-ocr-protonx.png">
 ---

 ## **Introduction**
+### **ProtonX Legal Text Correction (v1.2-NC)**
+A **specialized Vietnamese correction model** engineered for **high-accuracy OCR post-processing**, especially **to fix noisy PaddleOCR outputs** in enterprise and legal workflows.
+#### **Best Use Case (Primary Focus)**: **Fixing PaddleOCR text errors**
+<img src="https://protonx.co/assets/img/paddle-ocr-protonx.png">
+The model is optimized to clean up real-world OCR mistakes such as:
+* missing or incorrect diacritics
+* broken word segmentation
+* misrecognized legal terms
+* punctuation artifacts
+* formatting inconsistencies
 Built on a Seq2Seq Transformer architecture, the model is trained on 70,000 correction pairs, including 20,000 pairs manually annotated by expert Vietnamese annotators, covering:
 1. **ROUGE-L: 98.44**
 - Achieved on the ProtonX Legal Correction Validation Dataset. The evaluation dataset will be released in an upcoming public release.
 ---