---
language:
- vi
license: apache-2.0
tags:
- t5
- vietnamese
- text2text-generation
- fill-mask
- denoising
datasets:
- VTSNLP/vietnamese_curated_dataset
pipeline_tag: fill-mask
library_name: transformers
---

# T5-Small Vietnamese

A T5-small model adapted for Vietnamese language through continual pretraining with ViT5 tokenizer.

## Model Description

This model combines:
- **Architecture**: [google-t5/t5-small](https://huggingface.co/google-t5/t5-small) (~60M parameters)
- **Tokenizer**: [VietAI/vit5-base](https://huggingface.co/VietAI/vit5-base) tokenizer (Vietnamese-optimized)
- **Pretraining**: Span corruption denoising objective on Vietnamese text

The model was created by:
1. Loading T5-small architecture
2. Replacing tokenizer with ViT5's Vietnamese tokenizer
3. Resizing embedding layer to match new vocabulary
4. Pretraining on Vietnamese corpus

## Training Details

### Training Data
- **Dataset**: [VTSNLP/vietnamese_curated_dataset](https://huggingface.co/datasets/VTSNLP/vietnamese_curated_dataset)
- **Samples**: 10,000,000 text samples out of 12,169,131 samples from the dataset (30 Gb)
- **Max Length**: 4,056 tokens

### Pretraining Objective
- **Method**: Span Corruption (T5-style denoising)
- **Noise Density**: 15%
- **Mean Span Length**: 3.0 tokens

## Usage

### Basic Usage (Fill-mask / Denoising)

```python
from transformers import T5ForConditionalGeneration, AutoTokenizer

model = T5ForConditionalGeneration.from_pretrained("nbdaaa/t5-small-vietnamese")
tokenizer = AutoTokenizer.from_pretrained("nbdaaa/t5-small-vietnamese")

text = "Bến Tre là <extra_id_0> của Việt Nam."
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs, max_length=50)

print(tokenizer.decode(outputs[0], skip_special_tokens=False))
````

**Expected output:**

```
<extra_id_0> một trong những tỉnh </s>
```

---

### Additional Examples

```python
test_cases = [
    "Hà Nội là <extra_id_0> của Việt Nam.",
    "Phở là món <extra_id_0> nổi tiếng của Việt Nam.",
    "Tôi <extra_id_0> học.",
    "Tiếng Việt là ngôn ngữ <extra_id_0> của người Việt.",
    "Con mèo đang <extra_id_0> trên ghế.",
    "Việt Nam là một <extra_id_0> nằm ở <extra_id_1> Á."
]

for text in test_cases:
    inputs = tokenizer(text, return_tensors="pt")
    outputs = model.generate(
        **inputs,
        max_length=50,
        num_beams=4,
        early_stopping=True
    )
    print(tokenizer.decode(outputs[0], skip_special_tokens=False))
```

---

## Zero-shot Downstream Task Examples

Although the model is **not fine-tuned for specific downstream tasks**, it can perform several tasks in a **zero-shot** manner by leveraging T5’s text-to-text formulation.

### Zero-shot Named Entity Recognition (NER)

```python
text = (
    "Ông Phạm Nhật Vượng là chủ tịch của tập đoàn Vingroup. "
    "Tên ông là <extra_id_0>, thực thể tổ chức là <extra_id_1>."
)

inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs, max_length=50)

print(tokenizer.decode(outputs[0], skip_special_tokens=False))
```

**Expected output:**

```
<extra_id_0> ông Phạm Nhật Vượng <extra_id_1> Vingroup </s>
```

---

### Zero-shot Contextual Question Answering (QA)

```python
text = (
    "Bối cảnh: Chiến thắng Điện Biên Phủ năm 1954 là một mốc son chói lọi "
    "trong lịch sử dân tộc Việt Nam. Dưới sự chỉ huy của Đại tướng Võ Nguyên Giáp, "
    "quân và dân ta đã đập tan tập đoàn cứ điểm mạnh nhất Đông Dương của thực dân Pháp "
    "sau 56 ngày đêm chiến đấu gian khổ. "
    "Câu hỏi: Ai là người chỉ huy quân đội Việt Nam trong chiến dịch Điện Biên Phủ? "
    "Trả lời: <extra_id_0>"
)

inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs, max_length=50)

print(tokenizer.decode(outputs[0], skip_special_tokens=False))
```

**Expected output:**

```
<extra_id_0> Đại tướng Võ Nguyên Giáp </s>
```

## Intended Uses

This model can be used as:
1. **Base model for fine-tuning** on Vietnamese NLP tasks:
   - Text summarization
   - Question answering
   - Text classification
   - Named Entity Recognition
   - Machine translation

2. **Fill-in-the-blank** style text completion

3. **Vietnamese language understanding** tasks

## Fine-tuning Example

```python
from transformers import T5ForConditionalGeneration, AutoTokenizer, Trainer, TrainingArguments

model = T5ForConditionalGeneration.from_pretrained("nbdaaa/t5-small-vietnamese")
tokenizer = AutoTokenizer.from_pretrained("nbdaaa/t5-small-vietnamese")

# Fine-tune on your downstream task
training_args = TrainingArguments(
    output_dir="./my-finetuned-model",
    per_device_train_batch_size=8,
    learning_rate=3e-5,
    num_train_epochs=3,
    # ... other arguments
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=your_dataset,
    # ...
)

trainer.train()
```

## Model Architecture

```
T5ForConditionalGeneration(
  (shared): Embedding(36334, 512)  # Resized for ViT5 tokenizer
  (encoder): T5Stack(
    (embed_tokens): Embedding(36334, 512)
    (block): ModuleList(6 layers)
    (final_layer_norm): T5LayerNorm()
  )
  (decoder): T5Stack(
    (embed_tokens): Embedding(36334, 512)
    (block): ModuleList(6 layers)
    (final_layer_norm): T5LayerNorm()
  )
  (lm_head): Linear(512, 36334)
)
```

## Citation

If you use this model, please cite:

```bibtex
@misc{t5-small-vietnamese,
  author = {nbdaaa},
  title = {T5-Small Vietnamese: A Vietnamese-adapted T5 model},
  year = {2024},
  publisher = {Hugging Face},
  url = {https://huggingface.co/nbdaaa/t5-small-vietnamese}
}
```

## Acknowledgments

- [Google T5](https://github.com/google-research/text-to-text-transfer-transformer) for the original T5 architecture
- [VietAI](https://github.com/vietai/ViT5) for the ViT5 Vietnamese tokenizer
- [VTSNLP](https://huggingface.co/VTSNLP) for the Vietnamese dataset