---
language: en
license: apache-2.0
tags:
- grammar-correction
- gec
- english
- flan-t5
- coedit
datasets:
- grammarly/coedit
base_model: google/flan-t5-small
pipeline_tag: text-generation
---

# FlanT5-Small Grammar Correction

Fine-tuned [google/flan-t5-small](https://huggingface.co/google/flan-t5-small) on the [grammarly/coedit](https://huggingface.co/datasets/grammarly/coedit) dataset for **English Grammar Error Correction (GEC)**.

## Training Details

- **Base model:** google/flan-t5-small (77M params)
- **Dataset:** [grammarly/coedit](https://huggingface.co/datasets/grammarly/coedit) (GEC subset, 2000 training examples)
- **Training recipe:** Based on [CoEdIT paper](https://arxiv.org/abs/2305.09857) (EMNLP 2023)
- **Epochs:** 3
- **Learning rate:** 3e-4
- **Final training loss:** 0.27

## Usage

```python
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("xhimanshuz/flan-t5-small-grammar-correction")
model = AutoModelForSeq2SeqLM.from_pretrained("xhimanshuz/flan-t5-small-grammar-correction")

text = "Fix the grammar: I goes to school yesterday and learn many thing."
inputs = tokenizer(text, return_tensors="pt", max_length=128, truncation=True)
outputs = model.generate(**inputs, max_new_tokens=128)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
# Output: I went to school yesterday and learned many things.
```

## Supported Instructions

Use instruction prefixes from the CoEdIT format:

- `"Fix the grammar: <text>"`
- `"Fix grammatical errors in this sentence: <text>"`
- `"Improve the grammaticality: <text>"`
- `"Remove all grammatical errors from this text: <text>"`

## Example Results

| Input | Output |
|-------|--------|
| I goes to school yesterday and learn many thing. | I went to school yesterday and learned many things. |
| She don't know what are she doing. | She doesn't know what she is doing. |
| The informations was very helpfull for our researchs. | The information was very helpful for our research. |
| He have went to the market and buyed some apple. | He has gone to the market and bought some apple. |
| The childs was playing in park when it start raining. | The children were playing in the park when it started raining. |

## Training Loss Curve

| Step | Loss | Epoch |
|------|------|-------|
| 1 | 0.669 | 0.00 |
| 100 | 0.484 | 0.40 |
| 250 | 0.448 | 1.00 |
| 500 | 0.325 | 2.00 |
| 750 | 0.292 | 3.00 |

## Scaling Up

This model was trained on a 2000-example subset on CPU as a demonstration. For better performance:

1. **More data:** Train on the full 19K GEC examples from `grammarly/coedit`, or all 69K examples (including simplification, paraphrasing, etc.)
2. **Larger model:** Use `google/flan-t5-base` (250M) or `google/flan-t5-large` (770M)
3. **GPU training:** Use A10G or A100 GPUs for faster training with larger batch sizes
4. **More epochs:** Train for 5 epochs with early stopping (CoEdIT paper recipe)

## Citation

```bibtex
@inproceedings{raheja2023coedit,
  title={CoEdIT: Text Editing by Task-Specific Instruction Tuning},
  author={Raheja, Vipul and Kumar, Dhruv and Koo, Ryan and Kang, Dongyeop},
  booktitle={EMNLP 2023},
  year={2023}
}
```