manavdhamecha77
/

WG-IndicBERT

+---
+license: mit
+language:
+- hi
+metrics:
+- exact_match
+base_model:
+- ai4bharat/IndicBERTv2-MLM-only
+pipeline_tag: token-classification
+library_name: transformers
+tags:
+- word-grouping
+- indic-nlp
+- hindi
+- token-classification
+- local-word-groups
+- bio-tagging
+---
+# WG-IndicBERT
+A token classification model fine-tuned from [IndicBERT v2](https://huggingface.co/ai4bharat/IndicBERTv2-MLM-only) for **Indic Word Grouping** (Local Word Group identification). Developed as part of the BHASHA 2025 Shared Task 2: IndicWG.
+- **Developed by:** Manav Dhamecha, Gaurav Damor, Sunil Choudhary, Pruthwik Mishra
+- **License:** MIT
+- **Base model:** [ai4bharat/IndicBERTv2-MLM-only](https://huggingface.co/ai4bharat/IndicBERTv2-MLM-only)
+- **Paper:** [Team Horizon at BHASHA Task 2](https://aclanthology.org/2025.bhasha-1.18/)
+- **Repository:** [manavdhamecha77/IndicGEC2025](https://github.com/manavdhamecha77/IndicGEC2025)
+---
+## What it does
+Given an input sentence in Hindi, the model identifies **Local Word Groups (LWGs)** — semantically cohesive sequences of words that convey a single complete meaning (e.g., noun compounds, postpositional groups, verb groups with auxiliaries, light verb constructions).
+The task is modeled as **BIO token classification** with three labels: `B` (beginning of a group), `I` (inside a group), `O` (outside / delimiter). The output is reconstructed into grouped sentences using `__` as the group boundary separator.
+**Exact Match Accuracy: 52.73% on the official test set**
+---
+## Quick Start
+```python
+from transformers import AutoTokenizer, AutoModelForTokenClassification
+import torch
+model_name = "manavdhamecha77/WG-IndicBERT"
+tokenizer = AutoTokenizer.from_pretrained(model_name)
+model = AutoModelForTokenClassification.from_pretrained(model_name)
+device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+model.to(device)
+sentence = "राम ने बाजार से सब्जियां खरीदीं।"
+inputs = tokenizer(sentence, return_tensors="pt", truncation=True, max_length=128).to(device)
+with torch.no_grad():
+    outputs = model(**inputs)
+predictions = torch.argmax(outputs.logits, dim=-1)[0].tolist()
+tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
+# Label map: {0: "B", 1: "I", 2: "O"}
+id2label = {0: "B", 1: "I", 2: "O"}
+for token, pred in zip(tokens, predictions):
+    if token not in ["[CLS]", "[SEP]", "[PAD]"]:
+        print(f"{token:20s} {id2label[pred]}")
+```
+---
+## Training Details
+The model is fine-tuned using `AutoModelForTokenClassification` with a class-weighted cross-entropy loss to address the dominant `O`-label imbalance. Labels are aligned to subword tokens using the tokenizer's `word_ids()` helper; only the first subword of each word is labeled, with subsequent subwords set to `-100`.
+| Parameter | Value |
+|---|---|
+| Optimizer | AdamW |
+| Learning Rate | 3×10⁻⁵ |
+| Batch Size | 8 (train/eval) |
+| Epochs | 20 |
+| Weight Decay | 0.01 |
+| Label Map | B:0, I:1, O:2 |
+| Hardware | H100 GPU (94GB) |
+**Training data:** Official BHASHA/IndicWG Hindi dataset (550 train / 100 dev / 226 test sentences), supplemented with 5K augmented sentences from a rule-based LWG finder applied to IndicCorp.
+---
+## Evaluation
+| Model | Dev EM (%) | Test EM (%) |
+|---|---|---|
+| MuRIL | 46.58 | 58.18 |
+| XLM-Roberta | 39.00 | 53.36 |
+| **IndicBERT v2 (this model)** | **35.40** | **52.73** |
+Evaluation metric: **Exact Match Accuracy** — a prediction is correct only if the entire reconstructed grouped sentence matches the gold output exactly.
+---
+## Limitations
+- Trained and evaluated on Hindi only; generalization to other Indic languages is not guaranteed.
+- Performance degrades on longer sentences (>40 words).
+- IndicBERT v2 was pretrained with MLM only, without task-specific fine-tuning on sequence labeling, which may explain its slightly lower performance compared to MuRIL on this task.
+- Gold annotations contain inconsistencies in multiword expressions and light-verb constructions, which caps achievable exact-match accuracy.
+---
+## Citation
+```bibtex
+@inproceedings{dhamecha2025horizonwg,
+  title     = {Team Horizon at {BHASHA} Task 2: Fine-tuning Multilingual Transformers for Indic Word Grouping},
+  author    = {Dhamecha, Manav and Damor, Gaurav and Choudhary, Sunil and Mishra, Pruthwik},
+  booktitle = {Proceedings of the 1st Workshop on Benchmarks, Harmonization, Annotation, and Standardization for Human-Centric AI in Indian Languages (BHASHA 2025)},
+  year      = {2025},
+  url       = {https://aclanthology.org/2025.bhasha-1.18/}
+}
+```