Instructions to use yammdd/vietnamese-diacritic-restoration with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use yammdd/vietnamese-diacritic-restoration with PEFT:
Task type is invalid.
- Transformers
How to use yammdd/vietnamese-diacritic-restoration with Transformers:
# Load model directly from transformers import AutoTokenizer, AutoModelForSeq2SeqLM tokenizer = AutoTokenizer.from_pretrained("yammdd/vietnamese-diacritic-restoration") model = AutoModelForSeq2SeqLM.from_pretrained("yammdd/vietnamese-diacritic-restoration") - Notebooks
- Google Colab
- Kaggle
BartPho-Syllable - Vietnamese Diacritic Restoration (LoRA)
Model Details
Model Description
This model is a Fine-tuned version of vinai/bartpho-syllable using LoRA (Low-Rank Adaptation). It is specifically designed for Vietnamese Diacritic Restoration.
Unlike general error correction systems, this model focuses on a single task only: Restoring missing Vietnamese diacritics in text written without tone marks (e.g., "trang phuc" → "trang phục").
The model does not attempt to correct teencode, slang, spelling mistakes, or grammatical errors beyond diacritic restoration.
The model was trained on a publicly available dataset subset, consisting of approximately 2,500,000 sentences split into training, validation, and test sets.
- Developed by: Thanh-Dan Bui
- Model type: Seq2Seq (Encoder-Decoder) with LoRA Adapter
- Language(s): Vietnamese
- License: MIT
- Finetuned from model:
vinai/bartpho-syllable
Uses
Direct Use
The model is designed exclusively for Vietnamese diacritic restoration. It takes Vietnamese text written without diacritics as input and outputs the same text with correct Vietnamese tone marks restored.
Example:
- Input: "toi dang xu ly mot bai toan them dau cho tieng Viet"
- Output: "tôi đang xử lý một bài toán thêm dấu cho tiếng Việt"
Out-of-Scope Use
- General spelling correction beyond diacritics.
- Normalization of teencode, slang, or informal expressions.
- Grammar correction or contextual rewriting.
- Translation from or to other languages.
- Open-ended text generation.
Bias, Risks, and Limitations
- Context Length: The model is optimized for sentence-level diacritic restoration (maximum length ~256 tokens). Long paragraphs should be split before processing.
- Lexical Ambiguity: Words written without diacritics may correspond to multiple valid Vietnamese forms (e.g., "ban" → "bàn", "bạn", "bán"). The model selects the most likely form based on local context, which may occasionally result in incorrect diacritics.
- Proper Nouns: Foreign names, abbreviations, or uncommon proper nouns may be incorrectly altered if they resemble Vietnamese syllables.
How to Get Started with the Model
You can use this model with the transformers libraries.
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, pipeline
path = "yammdd/vietnamese-diacritic-restoration"
tokenizer = AutoTokenizer.from_pretrained(path)
model = AutoModelForSeq2SeqLM.from_pretrained(path)
pipe = pipeline("text2text-generation", model=model, tokenizer=tokenizer)
text = "hom nay toi rat vui khi hoc xu ly ngon ngu tu nhien"
out = pipe(text, max_new_tokens=256)
print(out[0]["generated_text"])
# Output: hôm nay tôi rất vui khi học xử lý ngôn ngữ tự nhiên
Training Details
Training Data
- Source: Aggregated Vietnamese text corpus.
- Task: Vietnamese diacritic restoration.
- Size: Approximately 2,500,000 sentence pairs (split into Train/Validation/Test sets).
- Data Format:
- no_diacritics: Vietnamese text with all diacritics removed.
- with_diacritics: The same text with correct Vietnamese diacritics restored.
- Sequence Length: Maximum input and output length of 128 tokens.
Training Procedure
- Base Model:
vinai/bartpho-syllable - Technique: Parameter-Efficient Fine-Tuning (PEFT) using LoRA (Low-Rank Adaptation).
- LoRA Configuration:
- Target Modules:
q_proj,v_proj,out_proj,fc1,fc2(covering both attention and feed-forward layers). - Rank (r): 32
- Alpha: 64
- Dropout: 0.1
- Target Modules:
- Precision: FP16 (Mixed Precision) for optimized memory usage and speed.
Training Hyperparameters
- Optimizer: AdamW with weight decay of 0.01.
- Batch Size: 16 per device (Total effective batch size depends on GPU count, typically 32 on 2x T4).
- Learning Rate: 1e-3.
- Training Epochs: 1.
- Evaluation Strategy: Every 2,000 steps.
- Label Smoothing: Implicitly handled by
DataCollatorForSeq2Seqwithlabel_pad_token_id=-100.
Speeds, Sizes, Times
- Hardware: 2x NVIDIA T4 GPUs (Kaggle environment).
- Checkpoint Size: The adapter weights are lightweight (only several megabytes), significantly smaller than the full BARTpho base model.
- Training Dynamics: Managed via the Hugging Face
Seq2SeqTrainerwithpredict_with_generateenabled for validation metrics.
Evaluation
Testing Data, Factors & Metrics
The model was evaluated on a held-out test set of 5,000 samples, covering a diverse range of Vietnamese sentence structures and lengths.
Metrics
- BLEU Score: Measures the n-gram overlap between the predicted and target text.
- Word Error Rate (WER): Measures the ratio of errors (substitutions, deletions, insertions) at the word level.
- Character Error Rate (CER): Measures accuracy at the character level, which is critical for verifying diacritic placement.
- Exact Match Accuracy: The percentage of sentences where every single character matches the ground truth.
- Word Accuracy: The percentage of individual words correctly predicted (excluding length mismatches).
Results
1. Overall Performance
| Metric | Score | Note |
|---|---|---|
| BLEU | 92.41 | High linguistic and semantic fidelity |
| Word Accuracy | 97.06% | Robust word-level correction |
| Exact Match | 69.16% | Entire sentence perfectly restored |
| WER | 0.0475 | ~4.75% error rate per word |
| CER | 0.0232 | ~2.32% error rate per character |
2. Accuracy by Sentence Length
The model's performance varies based on the complexity and length of the input:
| Category | Length (words) | Accuracy | Sample Count |
|---|---|---|---|
| Short | < 10 | 74.95% | 1,469 |
| Medium | 10 - 30 | 71.36% | 2,999 |
| Long | > 30 | 40.79% | 532 |
Environmental Impact
- Hardware Type: 2 x NVIDIA Tesla T4 GPUs.
- Cloud Provider: Kaggle.
- Training Duration: ~40 hours.
- Carbon Emitted: Estimated based on the total GPU hours and the carbon intensity of the hosting region.
Framework Versions
- PEFT: 0.18.0
- Transformers: 4.57.3
- PyTorch: 2.9.0
- Datasets: 4.0.0
Note
- This model is intended for educational and research purposes only.
- Downloads last month
- -
Model tree for yammdd/vietnamese-diacritic-restoration
Base model
vinai/bartpho-syllable