--- license_file: LICENSE.md library_name: protonx-text-correction tags: - text-to-text language: - vi ---

High-Accuracy Vietnamese Text Correction v1.3

[![GitHub](https://img.shields.io/badge/ProtonX-GitHub-black?logo=github)](https://github.com/protonx-engineering/protonx-text-correction) [![HuggingFace](https://img.shields.io/badge/HuggingFace-Model-black?logo=huggingface)](https://huggingface.co/protonx-models/protonx-tc) [![Website](https://img.shields.io/badge/protonx.co-Website-blue)](https://protonx.co) [![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/17m37QYMG4LO6oyMdkTxNtFzQW8uWDd_-?usp=sharing)
--- ## **Introduction** ### **ProtonX Text Correction (v1.3-NC)** A specialized Vietnamese text correction model engineered for high-accuracy normalization of legal and enterprise text. Optimized for OCR post-processing (including PaddleOCR outputs), but also capable of cleaning broader Vietnamese text with diacritic restoration, segmentation repair, and correction of domain-specific terminology. The model is optimized to clean up real-world OCR mistakes such as: * missing or incorrect diacritics * broken word segmentation * misrecognized legal terms * punctuation artifacts * formatting inconsistencies Built on a Seq2Seq Transformer architecture, the model is trained on 70,000 correction pairs, including 20,000 pairs manually annotated by expert Vietnamese annotators, covering: * official legal documents * OCR outputs from scanned PDFs * colloquial → standardized legal text Strict constraints ensure: * **Correction ≠ rewriting** * meaning of legal text must never change * no hallucination / no added legal terms * confidence-based correction * no paraphrasing --- ## **LICENSE** This model is released under the ProtonX Text Correction Model License (v1.3-NC). See [LICENSE.md](./LICENSE.md) for full terms, conditions, and usage restrictions. ## **Current Version**: v1.3 ## **Highlights** 1. **ROUGE-L: Coming soon** - Achieved on the ProtonX Legal Correction Validation Dataset. The evaluation dataset will be released in an upcoming public release. - Extended maximum sequence length from 32 tokens in v1.2 to 128 tokens in this release. --- ## **Quick Usage with Transformers** ```python import torch from transformers import AutoTokenizer, AutoModelForSeq2SeqLM model_path = "protonx-models/protonx-legal-tc" tokenizer = AutoTokenizer.from_pretrained(model_path) model = AutoModelForSeq2SeqLM.from_pretrained(model_path) device = torch.device("cuda" if torch.cuda.is_available() else "cpu") model.to(device) model.eval() examples = [ "can cu bo luat lao dong 2019 va cac van ban huong dan thuc hien.", ] for text in examples: inputs = tokenizer( text, return_tensors="pt", truncation=True, max_length=128 ).to(device) with torch.no_grad(): outputs = model.generate( **inputs, num_beams=10, max_new_tokens=128, length_penalty=1.0, early_stopping=True, repetition_penalty=1.2, no_repeat_ngram_size=2, pad_token_id=tokenizer.pad_token_id, eos_token_id=tokenizer.eos_token_id, ) result = tokenizer.decode(outputs[0], skip_special_tokens=True) print(f"Input: {text}") print(f"Output: {result}") print("-" * 30) ``` --- ## **Benchmark** ### **ProtonX Legal Text Correction Validation Dataset** | Metric | Score | | ------------- | --------- | | **ROUGE-L** | **Coming soon** | --- ## **Training Details** * Model: Seq2Seq Transformer * Legal-domain augmentation * Beam search decoding * Max sequence length: 256 tokens total (128 tokens for input and 128 tokens for output). * High-precision diacritic + punctuation restoration ### Domain Coverage: * Government decrees * Resolutions * Contract clauses * Administrative procedures * OCR-normalized scanned documents --- ## **Example Outputs** **Input:** ``` Cǎn cú Hién pháp nuóc Cōng hòa xā hi chù nghia Viēt Nam; ``` **Output:** ``` Căn cứ Hiến pháp nước Cộng hòa xã hội chủ nghĩa Việt Nam; ``` --- ## **Use Cases** * Legal OCR text normalization * Standardizing government documents * Contract proofreading * Preprocessing for legal RAG systems * Administrative workflow automation * Compliance document processing --- ## **Limitations** * Does not paraphrase or rewrite legal clauses * Cannot restore missing semantic content * Primarily optimized for Vietnamese * Not designed for informal social media slang --- ## **Future Work** * Achieving even higher ROUGE-L performance on legal-domain datasets * Extending maximum sequence length from 128 to 1024 tokens for long-clause legal documents --- ## **Acknowledgments** Thanks to: * [vit5-base](https://huggingface.co/VietAI/vit5-base)