| --- |
| license_file: LICENSE.md |
| library_name: protonx-text-correction |
| tags: |
| - text-to-text |
| language: |
| - vi |
| --- |
| |
| <div align="center"> |
|
|
| <p align="center"> |
| <img src="https://storage.googleapis.com/mle-courses-prod/users/61b6fa1ba83a7e37c8309756/private-files/678dadd0-603b-11ef-b0a7-998b84b38d43-ProtonX_logo_horizontally__1_.png" width="260"/> |
| </p> |
| |
| <h1 align="center"> |
| High-Accuracy Vietnamese Text Correction v1.3 |
| </h1> |
|
|
| [](https://github.com/protonx-engineering/protonx-text-correction) |
| [](https://huggingface.co/protonx-models/protonx-tc) |
| [](https://protonx.co) |
| [](https://colab.research.google.com/drive/17m37QYMG4LO6oyMdkTxNtFzQW8uWDd_-?usp=sharing) |
|
|
|
|
| </div> |
|
|
| --- |
|
|
| ## **Introduction** |
|
|
| <img src="https://storage.googleapis.com/mle-courses-prod/users/61b6fa1ba83a7e37c8309756/private-files/1795a9d0-cb4d-11f0-a59b-27096d42dd86-Screen_Shot_2025-11-27_at_11.53.12.png"> |
|
|
| ### **ProtonX Text Correction (v1.3-NC)** |
|
|
| A specialized Vietnamese text correction model engineered for high-accuracy normalization of legal and enterprise text. Optimized for OCR post-processing (including PaddleOCR outputs), but also capable of cleaning broader Vietnamese text with diacritic restoration, segmentation repair, and correction of domain-specific terminology. |
|
|
|
|
| <img src="https://protonx.co/assets/img/paddle-ocr-protonx.png"> |
|
|
| The model is optimized to clean up real-world OCR mistakes such as: |
|
|
| * missing or incorrect diacritics |
| * broken word segmentation |
| * misrecognized legal terms |
| * punctuation artifacts |
| * formatting inconsistencies |
|
|
| Built on a Seq2Seq Transformer architecture, the model is trained on 70,000 correction pairs, including 20,000 pairs manually annotated by expert Vietnamese annotators, covering: |
|
|
| * official legal documents |
| * OCR outputs from scanned PDFs |
| * colloquial → standardized legal text |
|
|
| Strict constraints ensure: |
|
|
| * **Correction ≠ rewriting** |
| * meaning of legal text must never change |
| * no hallucination / no added legal terms |
| * confidence-based correction |
| * no paraphrasing |
|
|
| --- |
|
|
| ## **LICENSE** |
|
|
| This model is released under the ProtonX Text Correction Model License (v1.3-NC). |
|
|
| See [LICENSE.md](./LICENSE.md) for full terms, conditions, and usage restrictions. |
|
|
| ## **Current Version**: v1.3 |
|
|
|
|
| ## **Highlights** |
|
|
|
|
| 1. **ROUGE-L: Coming soon** |
| - Achieved on the ProtonX Legal Correction Validation Dataset. The evaluation dataset will be released in an upcoming public release. |
| - Extended maximum sequence length from 32 tokens in v1.2 to 128 tokens in this release. |
|
|
|
|
| --- |
|
|
| ## **Quick Usage with Transformers** |
|
|
| ```python |
| import torch |
| from transformers import AutoTokenizer, AutoModelForSeq2SeqLM |
| |
| model_path = "protonx-models/protonx-legal-tc" |
| |
| tokenizer = AutoTokenizer.from_pretrained(model_path) |
| model = AutoModelForSeq2SeqLM.from_pretrained(model_path) |
| |
| device = torch.device("cuda" if torch.cuda.is_available() else "cpu") |
| model.to(device) |
| model.eval() |
| |
| examples = [ |
| "can cu bo luat lao dong 2019 va cac van ban huong dan thuc hien.", |
| ] |
| |
| for text in examples: |
| inputs = tokenizer( |
| text, |
| return_tensors="pt", |
| truncation=True, |
| max_length=128 |
| ).to(device) |
| |
| with torch.no_grad(): |
| outputs = model.generate( |
| **inputs, |
| num_beams=10, |
| max_new_tokens=128, |
| length_penalty=1.0, |
| early_stopping=True, |
| repetition_penalty=1.2, |
| no_repeat_ngram_size=2, |
| pad_token_id=tokenizer.pad_token_id, |
| eos_token_id=tokenizer.eos_token_id, |
| ) |
| |
| result = tokenizer.decode(outputs[0], skip_special_tokens=True) |
| |
| print(f"Input: {text}") |
| print(f"Output: {result}") |
| print("-" * 30) |
| ``` |
|
|
| --- |
|
|
| ## **Benchmark** |
|
|
| ### **ProtonX Legal Text Correction Validation Dataset** |
|
|
| | Metric | Score | |
| | ------------- | --------- | |
| | **ROUGE-L** | **Coming soon** | |
|
|
| --- |
|
|
|
|
| ## **Training Details** |
|
|
| * Model: Seq2Seq Transformer |
| * Legal-domain augmentation |
| * Beam search decoding |
| * Max sequence length: 256 tokens total (128 tokens for input and 128 tokens for output). |
| * High-precision diacritic + punctuation restoration |
|
|
| ### Domain Coverage: |
|
|
| * Government decrees |
| * Resolutions |
| * Contract clauses |
| * Administrative procedures |
| * OCR-normalized scanned documents |
|
|
| --- |
|
|
| ## **Example Outputs** |
|
|
|
|
| **Input:** |
|
|
| ``` |
| Cǎn cú Hién pháp nuóc Cōng hòa xā hi chù nghia Viēt Nam; |
| ``` |
|
|
| **Output:** |
|
|
| ``` |
| Căn cứ Hiến pháp nước Cộng hòa xã hội chủ nghĩa Việt Nam; |
| ``` |
|
|
| --- |
|
|
| ## **Use Cases** |
|
|
| * Legal OCR text normalization |
| * Standardizing government documents |
| * Contract proofreading |
| * Preprocessing for legal RAG systems |
| * Administrative workflow automation |
| * Compliance document processing |
|
|
| --- |
|
|
| ## **Limitations** |
|
|
| * Does not paraphrase or rewrite legal clauses |
| * Cannot restore missing semantic content |
| * Primarily optimized for Vietnamese |
| * Not designed for informal social media slang |
|
|
| --- |
|
|
| ## **Future Work** |
|
|
| * Achieving even higher ROUGE-L performance on legal-domain datasets |
| * Extending maximum sequence length from 128 to 1024 tokens for long-clause legal documents |
| --- |
|
|
| ## **Acknowledgments** |
|
|
| Thanks to: |
|
|
| * [vit5-base](https://huggingface.co/VietAI/vit5-base) |
|
|