protonx-legal-tc / README.md
ngoc's picture
add colab link
2bbc28a
|
Raw
History Blame
5.45 kB
metadata
license_file: LICENSE.md
library_name: protonx-text-correction
tags:
  - text-to-text
language:
  - vi

High-Accuracy Vietnamese Text Correction v1.3

GitHub HuggingFace Website Colab


Introduction

ProtonX Text Correction (v1.3-NC)

A specialized Vietnamese text correction model engineered for high-accuracy normalization of legal and enterprise text. Optimized for OCR post-processing (including PaddleOCR outputs), but also capable of cleaning broader Vietnamese text with diacritic restoration, segmentation repair, and correction of domain-specific terminology.

The model is optimized to clean up real-world OCR mistakes such as:

  • missing or incorrect diacritics
  • broken word segmentation
  • misrecognized legal terms
  • punctuation artifacts
  • formatting inconsistencies

Built on a Seq2Seq Transformer architecture, the model is trained on 70,000 correction pairs, including 20,000 pairs manually annotated by expert Vietnamese annotators, covering:

  • official legal documents
  • OCR outputs from scanned PDFs
  • colloquial → standardized legal text

Strict constraints ensure:

  • Correction ≠ rewriting
  • meaning of legal text must never change
  • no hallucination / no added legal terms
  • confidence-based correction
  • no paraphrasing

LICENSE

This model is released under the ProtonX Text Correction Model License (v1.3-NC).

See LICENSE.md for full terms, conditions, and usage restrictions.

Current Version: v1.3

Highlights

  1. ROUGE-L: Coming soon
  • Achieved on the ProtonX Legal Correction Validation Dataset. The evaluation dataset will be released in an upcoming public release.
  • Extended maximum sequence length from 32 tokens in v1.2 to 128 tokens in this release.

Quick Usage with Transformers

import torch
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model_path = "protonx-models/protonx-legal-tc"

tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForSeq2SeqLM.from_pretrained(model_path)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
model.eval()

examples = [
    "can cu bo luat lao dong 2019 va cac van ban huong dan thuc hien.",
]

for text in examples:
    inputs = tokenizer(
        text,
        return_tensors="pt",
        truncation=True,
        max_length=128
    ).to(device)

    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            num_beams=10,
            max_new_tokens=128,
            length_penalty=1.0,
            early_stopping=True,
            repetition_penalty=1.2,
            no_repeat_ngram_size=2,
            pad_token_id=tokenizer.pad_token_id,
            eos_token_id=tokenizer.eos_token_id,
        )

    result = tokenizer.decode(outputs[0], skip_special_tokens=True)

    print(f"Input:  {text}")
    print(f"Output: {result}")
    print("-" * 30)

Benchmark

ProtonX Legal Text Correction Validation Dataset

Metric Score
ROUGE-L Coming soon

Training Details

  • Model: Seq2Seq Transformer
  • Legal-domain augmentation
  • Beam search decoding
  • Max sequence length: 256 tokens total (128 tokens for input and 128 tokens for output).
  • High-precision diacritic + punctuation restoration

Domain Coverage:

  • Government decrees
  • Resolutions
  • Contract clauses
  • Administrative procedures
  • OCR-normalized scanned documents

Example Outputs

Input:

Cǎn cú Hién pháp nuóc Cōng hòa xā hi chù nghia Viēt Nam;

Output:

Căn cứ Hiến pháp nước Cộng hòa xã hội chủ nghĩa Việt Nam;

Use Cases

  • Legal OCR text normalization
  • Standardizing government documents
  • Contract proofreading
  • Preprocessing for legal RAG systems
  • Administrative workflow automation
  • Compliance document processing

Limitations

  • Does not paraphrase or rewrite legal clauses
  • Cannot restore missing semantic content
  • Primarily optimized for Vietnamese
  • Not designed for informal social media slang

Future Work

  • Achieving even higher ROUGE-L performance on legal-domain datasets
  • Extending maximum sequence length from 128 to 1024 tokens for long-clause legal documents

Acknowledgments

Thanks to: