--- language: en license: apache-2.0 tags: - grammar-correction - gec - english - flan-t5 - coedit datasets: - grammarly/coedit base_model: google/flan-t5-small pipeline_tag: text-generation --- # FlanT5-Small Grammar Correction Fine-tuned [google/flan-t5-small](https://huggingface.co/google/flan-t5-small) on the [grammarly/coedit](https://huggingface.co/datasets/grammarly/coedit) dataset for **English Grammar Error Correction (GEC)**. ## Training Details - **Base model:** google/flan-t5-small (77M params) - **Dataset:** [grammarly/coedit](https://huggingface.co/datasets/grammarly/coedit) (GEC subset, 2000 training examples) - **Training recipe:** Based on [CoEdIT paper](https://arxiv.org/abs/2305.09857) (EMNLP 2023) - **Epochs:** 3 - **Learning rate:** 3e-4 - **Final training loss:** 0.27 ## Usage ```python from transformers import AutoTokenizer, AutoModelForSeq2SeqLM tokenizer = AutoTokenizer.from_pretrained("xhimanshuz/flan-t5-small-grammar-correction") model = AutoModelForSeq2SeqLM.from_pretrained("xhimanshuz/flan-t5-small-grammar-correction") text = "Fix the grammar: I goes to school yesterday and learn many thing." inputs = tokenizer(text, return_tensors="pt", max_length=128, truncation=True) outputs = model.generate(**inputs, max_new_tokens=128) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) # Output: I went to school yesterday and learned many things. ``` ## Supported Instructions Use instruction prefixes from the CoEdIT format: - `"Fix the grammar: "` - `"Fix grammatical errors in this sentence: "` - `"Improve the grammaticality: "` - `"Remove all grammatical errors from this text: "` ## Example Results | Input | Output | |-------|--------| | I goes to school yesterday and learn many thing. | I went to school yesterday and learned many things. | | She don't know what are she doing. | She doesn't know what she is doing. | | The informations was very helpfull for our researchs. | The information was very helpful for our research. | | He have went to the market and buyed some apple. | He has gone to the market and bought some apple. | | The childs was playing in park when it start raining. | The children were playing in the park when it started raining. | ## Training Loss Curve | Step | Loss | Epoch | |------|------|-------| | 1 | 0.669 | 0.00 | | 100 | 0.484 | 0.40 | | 250 | 0.448 | 1.00 | | 500 | 0.325 | 2.00 | | 750 | 0.292 | 3.00 | ## Scaling Up This model was trained on a 2000-example subset on CPU as a demonstration. For better performance: 1. **More data:** Train on the full 19K GEC examples from `grammarly/coedit`, or all 69K examples (including simplification, paraphrasing, etc.) 2. **Larger model:** Use `google/flan-t5-base` (250M) or `google/flan-t5-large` (770M) 3. **GPU training:** Use A10G or A100 GPUs for faster training with larger batch sizes 4. **More epochs:** Train for 5 epochs with early stopping (CoEdIT paper recipe) ## Citation ```bibtex @inproceedings{raheja2023coedit, title={CoEdIT: Text Editing by Task-Specific Instruction Tuning}, author={Raheja, Vipul and Kumar, Dhruv and Koo, Ryan and Kang, Dongyeop}, booktitle={EMNLP 2023}, year={2023} } ```