Token Classification
Transformers
Safetensors
Hindi
bert
word-grouping
indic-nlp
hindi
local-word-groups
bio-tagging
Instructions to use manavdhamecha77/WG-GoogleMuril with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use manavdhamecha77/WG-GoogleMuril with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("token-classification", model="manavdhamecha77/WG-GoogleMuril")# Load model directly from transformers import AutoTokenizer, AutoModelForTokenClassification tokenizer = AutoTokenizer.from_pretrained("manavdhamecha77/WG-GoogleMuril") model = AutoModelForTokenClassification.from_pretrained("manavdhamecha77/WG-GoogleMuril") - Notebooks
- Google Colab
- Kaggle
File size: 4,732 Bytes
403ef16 6e9bb6b 260ef71 403ef16 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 | ---
license: mit
language:
- hi
metrics:
- exact_match
base_model:
- google/muril-base-cased
pipeline_tag: token-classification
library_name: transformers
tags:
- word-grouping
- indic-nlp
- hindi
- token-classification
- local-word-groups
- bio-tagging
---
# WG-GoogleMuril
A token classification model fine-tuned from [MuRIL](https://huggingface.co/google/muril-base-cased) for **Indic Word Grouping** (Local Word Group identification). Developed as part of the BHASHA 2025 Shared Task 2: IndicWG.
**🏆 Ranked 1st among all participating teams at BHASHA 2025.**
- **Developed by:** Manav Dhamecha, Gaurav Damor, Sunil Choudhary, Pruthwik Mishra
- **License:** MIT
- **Base model:** [google/muril-base-cased](https://huggingface.co/google/muril-base-cased)
- **Paper:** [Team Horizon at BHASHA Task 2](https://aclanthology.org/2025.bhasha-1.18/)
- **Repository:** [manavdhamecha77/IndicWG2025](https://github.com/manavdhamecha77/IndicWG2025)
- **GitHub.io:** [Indic Word Grouping](https://manavdhamecha77.github.io/wg/)
---
## What it does
Given an input sentence in Hindi, the model identifies **Local Word Groups (LWGs)** — semantically cohesive sequences of words that convey a single complete meaning (e.g., noun compounds, postpositional groups, verb groups with auxiliaries, light verb constructions).
The task is modeled as **BIO token classification** with three labels: `B` (beginning of a group), `I` (inside a group), `O` (outside / delimiter). The output is reconstructed into grouped sentences using `__` as the group boundary separator.
**Exact Match Accuracy: 58.18% on the official test set (post-challenge refined model)**
---
## Quick Start
```python
from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch
model_name = "manavdhamecha77/WG-GoogleMuril"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
sentence = "राम ने बाजार से सब्जियां खरीदीं।"
inputs = tokenizer(sentence, return_tensors="pt", truncation=True, max_length=128).to(device)
with torch.no_grad():
outputs = model(**inputs)
predictions = torch.argmax(outputs.logits, dim=-1)[0].tolist()
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
# Label map: {0: "B", 1: "I", 2: "O"}
id2label = {0: "B", 1: "I", 2: "O"}
for token, pred in zip(tokens, predictions):
if token not in ["[CLS]", "[SEP]", "[PAD]"]:
print(f"{token:20s} {id2label[pred]}")
```
---
## Training Details
The model is fine-tuned using `AutoModelForTokenClassification` with a class-weighted cross-entropy loss to address the dominant `O`-label imbalance (inverse-frequency weights upweight `B` and `I` labels). Labels are aligned to subword tokens using the tokenizer's `word_ids()` helper; only the first subword of each word is labeled, with subsequent subwords set to `-100`.
| Parameter | Value |
|---|---|
| Optimizer | AdamW |
| Learning Rate | 3×10⁻⁵ |
| Batch Size | 8 (train/eval) |
| Epochs | 20 |
| Weight Decay | 0.01 |
| Label Map | B:0, I:1, O:2 |
| Hardware | H100 GPU (94GB) |
**Training data:** Official BHASHA/IndicWG Hindi dataset (550 train / 100 dev / 226 test sentences), supplemented with 5K augmented sentences from a rule-based LWG finder applied to IndicCorp.
---
## Evaluation
| Model | Dev EM (%) | Test EM (%) |
|---|---|---|
| **MuRIL (this model)** | **46.58** | **58.18** |
| XLM-Roberta | 39.00 | 53.36 |
| IndicBERT v2 | 35.40 | 52.73 |
Evaluation metric: **Exact Match Accuracy** — a prediction is correct only if the entire reconstructed grouped sentence matches the gold output exactly.
---
## Limitations
- Trained and evaluated on Hindi only; generalization to other Indic languages is not guaranteed.
- Performance degrades on longer sentences (>40 words: ~20% EM).
- Sensitive to subword tokenization boundaries, which can cause off-by-one grouping errors.
- Gold annotations contain inconsistencies in multiword expressions and light-verb constructions, which caps achievable exact-match accuracy.
---
## Citation
```bibtex
@inproceedings{dhamecha2025horizonwg,
title = {Team Horizon at {BHASHA} Task 2: Fine-tuning Multilingual Transformers for Indic Word Grouping},
author = {Dhamecha, Manav and Damor, Gaurav and Choudhary, Sunil and Mishra, Pruthwik},
booktitle = {Proceedings of the 1st Workshop on Benchmarks, Harmonization, Annotation, and Standardization for Human-Centric AI in Indian Languages (BHASHA 2025)},
year = {2025},
url = {https://aclanthology.org/2025.bhasha-1.18/}
}
```
|