BartPho-Syllable - Vietnamese Diacritic Restoration (LoRA)

Model Details

Model Description

This model is a Fine-tuned version of vinai/bartpho-syllable using LoRA (Low-Rank Adaptation). It is specifically designed for Vietnamese Diacritic Restoration.

Unlike general error correction systems, this model focuses on a single task only: Restoring missing Vietnamese diacritics in text written without tone marks (e.g., "trang phuc" → "trang phục").

The model does not attempt to correct teencode, slang, spelling mistakes, or grammatical errors beyond diacritic restoration.

The model was trained on a publicly available dataset subset, consisting of approximately 2,500,000 sentences split into training, validation, and test sets.

  • Developed by: Thanh-Dan Bui
  • Model type: Seq2Seq (Encoder-Decoder) with LoRA Adapter
  • Language(s): Vietnamese
  • License: MIT
  • Finetuned from model: vinai/bartpho-syllable

Uses

Direct Use

The model is designed exclusively for Vietnamese diacritic restoration. It takes Vietnamese text written without diacritics as input and outputs the same text with correct Vietnamese tone marks restored.

Example:

  • Input: "toi dang xu ly mot bai toan them dau cho tieng Viet"
  • Output: "tôi đang xử lý một bài toán thêm dấu cho tiếng Việt"

Out-of-Scope Use

  • General spelling correction beyond diacritics.
  • Normalization of teencode, slang, or informal expressions.
  • Grammar correction or contextual rewriting.
  • Translation from or to other languages.
  • Open-ended text generation.

Bias, Risks, and Limitations

  • Context Length: The model is optimized for sentence-level diacritic restoration (maximum length ~256 tokens). Long paragraphs should be split before processing.
  • Lexical Ambiguity: Words written without diacritics may correspond to multiple valid Vietnamese forms (e.g., "ban" → "bàn", "bạn", "bán"). The model selects the most likely form based on local context, which may occasionally result in incorrect diacritics.
  • Proper Nouns: Foreign names, abbreviations, or uncommon proper nouns may be incorrectly altered if they resemble Vietnamese syllables.

How to Get Started with the Model

You can use this model with the transformers libraries.

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, pipeline

path = "yammdd/vietnamese-diacritic-restoration"

tokenizer = AutoTokenizer.from_pretrained(path)
model = AutoModelForSeq2SeqLM.from_pretrained(path)

pipe = pipeline("text2text-generation", model=model, tokenizer=tokenizer)

text = "hom nay toi rat vui khi hoc xu ly ngon ngu tu nhien"
out = pipe(text, max_new_tokens=256)

print(out[0]["generated_text"])
# Output: hôm nay tôi rất vui khi học xử lý ngôn ngữ tự nhiên

Training Details

Training Data

  • Source: Aggregated Vietnamese text corpus.
  • Task: Vietnamese diacritic restoration.
  • Size: Approximately 2,500,000 sentence pairs (split into Train/Validation/Test sets).
  • Data Format:
    • no_diacritics: Vietnamese text with all diacritics removed.
    • with_diacritics: The same text with correct Vietnamese diacritics restored.
  • Sequence Length: Maximum input and output length of 128 tokens.

Training Procedure

  • Base Model: vinai/bartpho-syllable
  • Technique: Parameter-Efficient Fine-Tuning (PEFT) using LoRA (Low-Rank Adaptation).
  • LoRA Configuration:
    • Target Modules: q_proj, v_proj, out_proj, fc1, fc2 (covering both attention and feed-forward layers).
    • Rank (r): 32
    • Alpha: 64
    • Dropout: 0.1
  • Precision: FP16 (Mixed Precision) for optimized memory usage and speed.

Training Hyperparameters

  • Optimizer: AdamW with weight decay of 0.01.
  • Batch Size: 16 per device (Total effective batch size depends on GPU count, typically 32 on 2x T4).
  • Learning Rate: 1e-3.
  • Training Epochs: 1.
  • Evaluation Strategy: Every 2,000 steps.
  • Label Smoothing: Implicitly handled by DataCollatorForSeq2Seq with label_pad_token_id=-100.

Speeds, Sizes, Times

  • Hardware: 2x NVIDIA T4 GPUs (Kaggle environment).
  • Checkpoint Size: The adapter weights are lightweight (only several megabytes), significantly smaller than the full BARTpho base model.
  • Training Dynamics: Managed via the Hugging Face Seq2SeqTrainer with predict_with_generate enabled for validation metrics.

Evaluation

Testing Data, Factors & Metrics

The model was evaluated on a held-out test set of 5,000 samples, covering a diverse range of Vietnamese sentence structures and lengths.

Metrics

  • BLEU Score: Measures the n-gram overlap between the predicted and target text.
  • Word Error Rate (WER): Measures the ratio of errors (substitutions, deletions, insertions) at the word level.
  • Character Error Rate (CER): Measures accuracy at the character level, which is critical for verifying diacritic placement.
  • Exact Match Accuracy: The percentage of sentences where every single character matches the ground truth.
  • Word Accuracy: The percentage of individual words correctly predicted (excluding length mismatches).

Results

1. Overall Performance

Metric Score Note
BLEU 92.41 High linguistic and semantic fidelity
Word Accuracy 97.06% Robust word-level correction
Exact Match 69.16% Entire sentence perfectly restored
WER 0.0475 ~4.75% error rate per word
CER 0.0232 ~2.32% error rate per character

2. Accuracy by Sentence Length

The model's performance varies based on the complexity and length of the input:

Category Length (words) Accuracy Sample Count
Short < 10 74.95% 1,469
Medium 10 - 30 71.36% 2,999
Long > 30 40.79% 532

Environmental Impact

  • Hardware Type: 2 x NVIDIA Tesla T4 GPUs.
  • Cloud Provider: Kaggle.
  • Training Duration: ~40 hours.
  • Carbon Emitted: Estimated based on the total GPU hours and the carbon intensity of the hosting region.

Framework Versions

  • PEFT: 0.18.0
  • Transformers: 4.57.3
  • PyTorch: 2.9.0
  • Datasets: 4.0.0

Note

  • This model is intended for educational and research purposes only.
Downloads last month
-
Safetensors
Model size
0.4B params
Tensor type
F16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for yammdd/vietnamese-diacritic-restoration

Adapter
(7)
this model

Dataset used to train yammdd/vietnamese-diacritic-restoration