add colab link

2bbc28a 7 months ago

5.45 kB

license_file: LICENSE.md
library_name: protonx-text-correction
tags:
  - text-to-text
language:
  - vi

High-Accuracy Vietnamese Text Correction v1.3

Introduction

ProtonX Text Correction (v1.3-NC)

A specialized Vietnamese text correction model engineered for high-accuracy normalization of legal and enterprise text. Optimized for OCR post-processing (including PaddleOCR outputs), but also capable of cleaning broader Vietnamese text with diacritic restoration, segmentation repair, and correction of domain-specific terminology.

The model is optimized to clean up real-world OCR mistakes such as:

missing or incorrect diacritics
broken word segmentation
misrecognized legal terms
punctuation artifacts
formatting inconsistencies

Built on a Seq2Seq Transformer architecture, the model is trained on 70,000 correction pairs, including 20,000 pairs manually annotated by expert Vietnamese annotators, covering:

official legal documents
OCR outputs from scanned PDFs
colloquial → standardized legal text

Strict constraints ensure:

Correction ≠ rewriting
meaning of legal text must never change
no hallucination / no added legal terms
confidence-based correction
no paraphrasing

LICENSE

This model is released under the ProtonX Text Correction Model License (v1.3-NC).

See LICENSE.md for full terms, conditions, and usage restrictions.

Current Version: v1.3

Highlights

ROUGE-L: Coming soon

Achieved on the ProtonX Legal Correction Validation Dataset. The evaluation dataset will be released in an upcoming public release.
Extended maximum sequence length from 32 tokens in v1.2 to 128 tokens in this release.

Quick Usage with Transformers

import torch
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model_path = "protonx-models/protonx-legal-tc"

tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForSeq2SeqLM.from_pretrained(model_path)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
model.eval()

examples = [
    "can cu bo luat lao dong 2019 va cac van ban huong dan thuc hien.",
]

for text in examples:
    inputs = tokenizer(
        text,
        return_tensors="pt",
        truncation=True,
        max_length=128
    ).to(device)

    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            num_beams=10,
            max_new_tokens=128,
            length_penalty=1.0,
            early_stopping=True,
            repetition_penalty=1.2,
            no_repeat_ngram_size=2,
            pad_token_id=tokenizer.pad_token_id,
            eos_token_id=tokenizer.eos_token_id,
        )

    result = tokenizer.decode(outputs[0], skip_special_tokens=True)

    print(f"Input:  {text}")
    print(f"Output: {result}")
    print("-" * 30)

Benchmark

ProtonX Legal Text Correction Validation Dataset

Metric	Score
ROUGE-L	Coming soon

Training Details

Model: Seq2Seq Transformer
Legal-domain augmentation
Beam search decoding
Max sequence length: 256 tokens total (128 tokens for input and 128 tokens for output).
High-precision diacritic + punctuation restoration

Domain Coverage:

Government decrees
Resolutions
Contract clauses
Administrative procedures
OCR-normalized scanned documents

Example Outputs

Input:

Cǎn cú Hién pháp nuóc Cōng hòa xā hi chù nghia Viēt Nam;

Output:

Căn cứ Hiến pháp nước Cộng hòa xã hội chủ nghĩa Việt Nam;

Use Cases

Legal OCR text normalization
Standardizing government documents
Contract proofreading
Preprocessing for legal RAG systems
Administrative workflow automation
Compliance document processing

Limitations

Does not paraphrase or rewrite legal clauses
Cannot restore missing semantic content
Primarily optimized for Vietnamese
Not designed for informal social media slang

Future Work

Achieving even higher ROUGE-L performance on legal-domain datasets
Extending maximum sequence length from 128 to 1024 tokens for long-clause legal documents

Acknowledgments

Thanks to:

vit5-base