WG-IndicBERT / README.md
manavdhamecha77's picture
Update README.md
0c51a03 verified
|
Raw
History Blame Contribute Delete
4.69 kB
metadata
license: mit
language:
  - hi
metrics:
  - exact_match
base_model:
  - ai4bharat/IndicBERTv2-MLM-only
pipeline_tag: token-classification
library_name: transformers
tags:
  - word-grouping
  - indic-nlp
  - hindi
  - token-classification
  - local-word-groups
  - bio-tagging

WG-IndicBERT

A token classification model fine-tuned from IndicBERT v2 for Indic Word Grouping (Local Word Group identification). Developed as part of the BHASHA 2025 Shared Task 2: IndicWG.


What it does

Given an input sentence in Hindi, the model identifies Local Word Groups (LWGs) — semantically cohesive sequences of words that convey a single complete meaning (e.g., noun compounds, postpositional groups, verb groups with auxiliaries, light verb constructions).

The task is modeled as BIO token classification with three labels: B (beginning of a group), I (inside a group), O (outside / delimiter). The output is reconstructed into grouped sentences using __ as the group boundary separator.

Exact Match Accuracy: 52.73% on the official test set


Quick Start

from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch

model_name = "manavdhamecha77/WG-IndicBERT"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

sentence = "राम ने बाजार से सब्जियां खरीदीं।"

inputs = tokenizer(sentence, return_tensors="pt", truncation=True, max_length=128).to(device)

with torch.no_grad():
    outputs = model(**inputs)

predictions = torch.argmax(outputs.logits, dim=-1)[0].tolist()
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])

# Label map: {0: "B", 1: "I", 2: "O"}
id2label = {0: "B", 1: "I", 2: "O"}

for token, pred in zip(tokens, predictions):
    if token not in ["[CLS]", "[SEP]", "[PAD]"]:
        print(f"{token:20s} {id2label[pred]}")

Training Details

The model is fine-tuned using AutoModelForTokenClassification with a class-weighted cross-entropy loss to address the dominant O-label imbalance. Labels are aligned to subword tokens using the tokenizer's word_ids() helper; only the first subword of each word is labeled, with subsequent subwords set to -100.

Parameter Value
Optimizer AdamW
Learning Rate 3×10⁻⁵
Batch Size 8 (train/eval)
Epochs 20
Weight Decay 0.01
Label Map B:0, I:1, O:2
Hardware H100 GPU (94GB)

Training data: Official BHASHA/IndicWG Hindi dataset (550 train / 100 dev / 226 test sentences), supplemented with 5K augmented sentences from a rule-based LWG finder applied to IndicCorp.


Evaluation

Model Dev EM (%) Test EM (%)
MuRIL 46.58 58.18
XLM-Roberta 39.00 53.36
IndicBERT v2 (this model) 35.40 52.73

Evaluation metric: Exact Match Accuracy — a prediction is correct only if the entire reconstructed grouped sentence matches the gold output exactly.


Limitations

  • Trained and evaluated on Hindi only; generalization to other Indic languages is not guaranteed.
  • Performance degrades on longer sentences (>40 words).
  • IndicBERT v2 was pretrained with MLM only, without task-specific fine-tuning on sequence labeling, which may explain its slightly lower performance compared to MuRIL on this task.
  • Gold annotations contain inconsistencies in multiword expressions and light-verb constructions, which caps achievable exact-match accuracy.

Citation

@inproceedings{dhamecha2025horizonwg,
  title     = {Team Horizon at {BHASHA} Task 2: Fine-tuning Multilingual Transformers for Indic Word Grouping},
  author    = {Dhamecha, Manav and Damor, Gaurav and Choudhary, Sunil and Mishra, Pruthwik},
  booktitle = {Proceedings of the 1st Workshop on Benchmarks, Harmonization, Annotation, and Standardization for Human-Centric AI in Indian Languages (BHASHA 2025)},
  year      = {2025},
  url       = {https://aclanthology.org/2025.bhasha-1.18/}
}