ngoc commited on
Commit
5261e13
·
1 Parent(s): 5206995

Update readme

Browse files
Files changed (1) hide show
  1. README.md +12 -16
README.md CHANGED
@@ -27,23 +27,21 @@ High-Accuracy Vietnamese Legal Document Correction
27
 
28
  ## **Introduction**
29
 
30
- **ProtonX Legal Text Correction (v1.2-NC)** is a **specialized Vietnamese correction model** optimized for:
31
 
32
- * legal texts
33
- * government and administrative documents
34
- * contracts, decrees, circulars
35
- * OCR post-processing
36
- * enterprise compliance systems
37
- * archival digitization workflows
38
 
39
- It corrects:
40
 
41
- * spelling
42
- * punctuation
43
- * diacritics
44
- * grammar
45
- * legal terminology normalization
46
- * formal writing style issues
 
 
 
47
 
48
  Built on a Seq2Seq Transformer architecture, the model is trained on 70,000 correction pairs, including 20,000 pairs manually annotated by expert Vietnamese annotators, covering:
49
 
@@ -67,10 +65,8 @@ Strict constraints ensure:
67
  1. **ROUGE-L: 98.44**
68
  - Achieved on the ProtonX Legal Correction Validation Dataset. The evaluation dataset will be released in an upcoming public release.
69
 
70
- 2. Fixing and normalizing PaddleOCR output to ensure high-quality downstream correction
71
 
72
 
73
- <img src="https://protonx.co/assets/img/paddle-ocr-protonx.png">
74
 
75
  ---
76
 
 
27
 
28
  ## **Introduction**
29
 
30
+ ### **ProtonX Legal Text Correction (v1.2-NC)**
31
 
32
+ A **specialized Vietnamese correction model** engineered for **high-accuracy OCR post-processing**, especially **to fix noisy PaddleOCR outputs** in enterprise and legal workflows.
 
 
 
 
 
33
 
34
+ #### **Best Use Case (Primary Focus)**: **Fixing PaddleOCR text errors**
35
 
36
+ <img src="https://protonx.co/assets/img/paddle-ocr-protonx.png">
37
+
38
+ The model is optimized to clean up real-world OCR mistakes such as:
39
+
40
+ * missing or incorrect diacritics
41
+ * broken word segmentation
42
+ * misrecognized legal terms
43
+ * punctuation artifacts
44
+ * formatting inconsistencies
45
 
46
  Built on a Seq2Seq Transformer architecture, the model is trained on 70,000 correction pairs, including 20,000 pairs manually annotated by expert Vietnamese annotators, covering:
47
 
 
65
  1. **ROUGE-L: 98.44**
66
  - Achieved on the ProtonX Legal Correction Validation Dataset. The evaluation dataset will be released in an upcoming public release.
67
 
 
68
 
69
 
 
70
 
71
  ---
72