GEC-mT5-Small-Hindi / README.md
manavdhamecha77's picture
Update README.md
36a1bcd verified
|
Raw
History Blame Contribute Delete
3.68 kB
metadata
license: mit
language:
  - hi
metrics:
  - google_bleu
base_model:
  - google/mt5-small
pipeline_tag: text2text-generation
library_name: transformers
tags:
  - grammatical-error-correction
  - indic-nlp
  - hindi
  - gec

mt5-small-indic-gec-hindi

A multilingual Grammatical Error Correction (GEC) model fine-tuned from mT5-small for Hindi. Developed as part of the BHASHA 2025 Shared Task 1: IndicGEC.


What it does

Given a grammatically incorrect Hindi sentence, the model outputs a corrected version. It handles errors across spelling, grammar (tense, person, number, gender, case), punctuation, missing/extra words, and semantic issues.

GLEU Score on Hindi test set: 80.44


Quick Start

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import torch

model_name = "manavdhamecha77/GEC-mT5-Small-Hindi"  # update with your HF repo name

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

sentences = [
    "मैं स्कूल जाती है।",
    "राम ने ने खाना खाया।",
    "वे किताबें पढ़ता है।",
]

inputs = ["correct this: " + s for s in sentences]

encoded = tokenizer(
    inputs,
    return_tensors="pt",
    padding=True,
    truncation=True,
    max_length=128
).to(device)

outputs = model.generate(**encoded, max_length=128, num_beams=4)
corrected = tokenizer.batch_decode(outputs, skip_special_tokens=True)

for orig, corr in zip(sentences, corrected):
    print(f"Input:     {orig}")
    print(f"Corrected: {corr}\n")

Training Details

The model was fine-tuned using a sequence-to-sequence objective on parallel noisy–clean sentence pairs. Training data was expanded from ~599 annotated pairs to ~10k pairs using a synthetic error injection pipeline that introduces realistic errors across 10 linguistic categories.

Parameter Value
Optimizer AdamW
Learning Rate 5e-5
Batch Size 16–32
Epochs 10–15
Max Sequence Length 128
Early Stopping Based on GLEU (dev set)

Input format: "correct this: <incorrect sentence>"


Evaluation

Language Model GLEU
Hindi mT5-small 80.44

Limitations

  • Performance may degrade on heavy code-mixing, informal slang, or dialectal text.
  • Trained primarily on formal written Hindi; may not generalize to all domains.
  • Evaluation uses automatic metrics (GLEU) only; human evaluation not conducted.

Citation

@inproceedings{dhamecha2025horizon,
  title     = {Team Horizon at {BHASHA} Task 1: Multilingual {IndicGEC} with Transformer-based Grammatical Error Correction Models},
  author    = {Dhamecha, Manav and Damor, Gaurav and Choudhary, Sunil and Mishra, Pruthwik},
  booktitle = {Proceedings of the 1st Workshop on Benchmarks, Harmonization, Annotation, and Standardization for Human-Centric AI in Indian Languages (BHASHA 2025)},
  year      = {2025},
  url       = {https://aclanthology.org/2025.bhasha-1.14/}
}