File size: 4,732 Bytes
403ef16
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6e9bb6b
260ef71
 
403ef16
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
---
license: mit
language:
- hi
metrics:
- exact_match
base_model:
- google/muril-base-cased
pipeline_tag: token-classification
library_name: transformers
tags:
- word-grouping
- indic-nlp
- hindi
- token-classification
- local-word-groups
- bio-tagging
---

# WG-GoogleMuril

A token classification model fine-tuned from [MuRIL](https://huggingface.co/google/muril-base-cased) for **Indic Word Grouping** (Local Word Group identification). Developed as part of the BHASHA 2025 Shared Task 2: IndicWG.

**🏆 Ranked 1st among all participating teams at BHASHA 2025.**

- **Developed by:** Manav Dhamecha, Gaurav Damor, Sunil Choudhary, Pruthwik Mishra
- **License:** MIT
- **Base model:** [google/muril-base-cased](https://huggingface.co/google/muril-base-cased)
- **Paper:** [Team Horizon at BHASHA Task 2](https://aclanthology.org/2025.bhasha-1.18/)
- **Repository:** [manavdhamecha77/IndicWG2025](https://github.com/manavdhamecha77/IndicWG2025)
- **GitHub.io:** [Indic Word Grouping](https://manavdhamecha77.github.io/wg/)


---

## What it does

Given an input sentence in Hindi, the model identifies **Local Word Groups (LWGs)** — semantically cohesive sequences of words that convey a single complete meaning (e.g., noun compounds, postpositional groups, verb groups with auxiliaries, light verb constructions).

The task is modeled as **BIO token classification** with three labels: `B` (beginning of a group), `I` (inside a group), `O` (outside / delimiter). The output is reconstructed into grouped sentences using `__` as the group boundary separator.

**Exact Match Accuracy: 58.18% on the official test set (post-challenge refined model)**

---

## Quick Start

```python
from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch

model_name = "manavdhamecha77/WG-GoogleMuril"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

sentence = "राम ने बाजार से सब्जियां खरीदीं।"

inputs = tokenizer(sentence, return_tensors="pt", truncation=True, max_length=128).to(device)

with torch.no_grad():
    outputs = model(**inputs)

predictions = torch.argmax(outputs.logits, dim=-1)[0].tolist()
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])

# Label map: {0: "B", 1: "I", 2: "O"}
id2label = {0: "B", 1: "I", 2: "O"}

for token, pred in zip(tokens, predictions):
    if token not in ["[CLS]", "[SEP]", "[PAD]"]:
        print(f"{token:20s} {id2label[pred]}")
```

---

## Training Details

The model is fine-tuned using `AutoModelForTokenClassification` with a class-weighted cross-entropy loss to address the dominant `O`-label imbalance (inverse-frequency weights upweight `B` and `I` labels). Labels are aligned to subword tokens using the tokenizer's `word_ids()` helper; only the first subword of each word is labeled, with subsequent subwords set to `-100`.

| Parameter | Value |
|---|---|
| Optimizer | AdamW |
| Learning Rate | 3×10⁻⁵ |
| Batch Size | 8 (train/eval) |
| Epochs | 20 |
| Weight Decay | 0.01 |
| Label Map | B:0, I:1, O:2 |
| Hardware | H100 GPU (94GB) |

**Training data:** Official BHASHA/IndicWG Hindi dataset (550 train / 100 dev / 226 test sentences), supplemented with 5K augmented sentences from a rule-based LWG finder applied to IndicCorp.

---

## Evaluation

| Model | Dev EM (%) | Test EM (%) |
|---|---|---|
| **MuRIL (this model)** | **46.58** | **58.18** |
| XLM-Roberta | 39.00 | 53.36 |
| IndicBERT v2 | 35.40 | 52.73 |

Evaluation metric: **Exact Match Accuracy** — a prediction is correct only if the entire reconstructed grouped sentence matches the gold output exactly.

---

## Limitations

- Trained and evaluated on Hindi only; generalization to other Indic languages is not guaranteed.
- Performance degrades on longer sentences (>40 words: ~20% EM).
- Sensitive to subword tokenization boundaries, which can cause off-by-one grouping errors.
- Gold annotations contain inconsistencies in multiword expressions and light-verb constructions, which caps achievable exact-match accuracy.

---

## Citation

```bibtex
@inproceedings{dhamecha2025horizonwg,
  title     = {Team Horizon at {BHASHA} Task 2: Fine-tuning Multilingual Transformers for Indic Word Grouping},
  author    = {Dhamecha, Manav and Damor, Gaurav and Choudhary, Sunil and Mishra, Pruthwik},
  booktitle = {Proceedings of the 1st Workshop on Benchmarks, Harmonization, Annotation, and Standardization for Human-Centric AI in Indian Languages (BHASHA 2025)},
  year      = {2025},
  url       = {https://aclanthology.org/2025.bhasha-1.18/}
}
```