manavdhamecha77 commited on
Commit
63927e4
·
verified ·
1 Parent(s): a0f57a8

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +121 -0
README.md ADDED
@@ -0,0 +1,121 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ language:
4
+ - hi
5
+ metrics:
6
+ - google_bleu
7
+ base_model:
8
+ - google/mt5-small
9
+ pipeline_tag: text2text-generation
10
+ library_name: transformers
11
+ tags:
12
+ - grammatical-error-correction
13
+ - indic-nlp
14
+ - hindi
15
+ - gec
16
+ ---
17
+
18
+ # mt5-small-indic-gec-hindi
19
+
20
+ A multilingual Grammatical Error Correction (GEC) model fine-tuned from [mT5-small](https://huggingface.co/google/mt5-small) for **Hindi**. Developed as part of the BHASHA 2025 Shared Task 1: IndicGEC.
21
+
22
+ - **Developed by:** Manav Dhamecha, Gaurav Damor, Sunil Choudhary, Pruthwik Mishra
23
+ - **License:** MIT
24
+ - **Base model:** [google/mt5-small](https://huggingface.co/google/mt5-small)
25
+ - **Paper:** [Team Horizon at BHASHA Task 1](https://aclanthology.org/2025.bhasha-1.14/)
26
+ - **Repository:** [manavdhamecha77/IndicGEC2025](https://github.com/manavdhamecha77/IndicGEC2025)
27
+
28
+ ---
29
+
30
+ ## What it does
31
+
32
+ Given a grammatically incorrect Hindi sentence, the model outputs a corrected version. It handles errors across spelling, grammar (tense, person, number, gender, case), punctuation, missing/extra words, and semantic issues.
33
+
34
+ **GLEU Score on Hindi test set: 80.44**
35
+
36
+ ---
37
+
38
+ ## Quick Start
39
+
40
+ ```python
41
+ from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
42
+ import torch
43
+
44
+ model_name = "manavdhamecha77/mt5-small-indic-gec-hindi" # update with your HF repo name
45
+
46
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
47
+ model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
48
+
49
+ device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
50
+ model.to(device)
51
+
52
+ sentences = [
53
+ "मैं स्कूल जाती है।",
54
+ "राम ने ने खाना खाया।",
55
+ "वे किताबें पढ़ता है।",
56
+ ]
57
+
58
+ inputs = ["correct this: " + s for s in sentences]
59
+
60
+ encoded = tokenizer(
61
+ inputs,
62
+ return_tensors="pt",
63
+ padding=True,
64
+ truncation=True,
65
+ max_length=128
66
+ ).to(device)
67
+
68
+ outputs = model.generate(**encoded, max_length=128, num_beams=4)
69
+ corrected = tokenizer.batch_decode(outputs, skip_special_tokens=True)
70
+
71
+ for orig, corr in zip(sentences, corrected):
72
+ print(f"Input: {orig}")
73
+ print(f"Corrected: {corr}\n")
74
+ ```
75
+
76
+ ---
77
+
78
+ ## Training Details
79
+
80
+ The model was fine-tuned using a sequence-to-sequence objective on parallel noisy–clean sentence pairs. Training data was expanded from ~599 annotated pairs to ~10k pairs using a synthetic error injection pipeline that introduces realistic errors across 10 linguistic categories.
81
+
82
+ | Parameter | Value |
83
+ |---|---|
84
+ | Optimizer | AdamW |
85
+ | Learning Rate | 5e-5 |
86
+ | Batch Size | 16–32 |
87
+ | Epochs | 10–15 |
88
+ | Max Sequence Length | 128 |
89
+ | Early Stopping | Based on GLEU (dev set) |
90
+
91
+ Input format: `"correct this: <incorrect sentence>"`
92
+
93
+ ---
94
+
95
+ ## Evaluation
96
+
97
+ | Language | Model | GLEU |
98
+ |---|---|---|
99
+ | Hindi | mT5-small | **80.44** |
100
+
101
+ ---
102
+
103
+ ## Limitations
104
+
105
+ - Performance may degrade on heavy code-mixing, informal slang, or dialectal text.
106
+ - Trained primarily on formal written Hindi; may not generalize to all domains.
107
+ - Evaluation uses automatic metrics (GLEU) only; human evaluation not conducted.
108
+
109
+ ---
110
+
111
+ ## Citation
112
+
113
+ ```bibtex
114
+ @inproceedings{dhamecha2025horizon,
115
+ title = {Team Horizon at {BHASHA} Task 1: Multilingual {IndicGEC} with Transformer-based Grammatical Error Correction Models},
116
+ author = {Dhamecha, Manav and Damor, Gaurav and Choudhary, Sunil and Mishra, Pruthwik},
117
+ booktitle = {Proceedings of the 1st Workshop on Benchmarks, Harmonization, Annotation, and Standardization for Human-Centric AI in Indian Languages (BHASHA 2025)},
118
+ year = {2025},
119
+ url = {https://aclanthology.org/2025.bhasha-1.14/}
120
+ }
121
+ ```