--- license: mit language: - hi metrics: - exact_match base_model: - google/muril-base-cased pipeline_tag: token-classification library_name: transformers tags: - word-grouping - indic-nlp - hindi - token-classification - local-word-groups - bio-tagging --- # WG-GoogleMuril A token classification model fine-tuned from [MuRIL](https://huggingface.co/google/muril-base-cased) for **Indic Word Grouping** (Local Word Group identification). Developed as part of the BHASHA 2025 Shared Task 2: IndicWG. **🏆 Ranked 1st among all participating teams at BHASHA 2025.** - **Developed by:** Manav Dhamecha, Gaurav Damor, Sunil Choudhary, Pruthwik Mishra - **License:** MIT - **Base model:** [google/muril-base-cased](https://huggingface.co/google/muril-base-cased) - **Paper:** [Team Horizon at BHASHA Task 2](https://aclanthology.org/2025.bhasha-1.18/) - **Repository:** [manavdhamecha77/IndicWG2025](https://github.com/manavdhamecha77/IndicWG2025) - **GitHub.io:** [Indic Word Grouping](https://manavdhamecha77.github.io/wg/) --- ## What it does Given an input sentence in Hindi, the model identifies **Local Word Groups (LWGs)** — semantically cohesive sequences of words that convey a single complete meaning (e.g., noun compounds, postpositional groups, verb groups with auxiliaries, light verb constructions). The task is modeled as **BIO token classification** with three labels: `B` (beginning of a group), `I` (inside a group), `O` (outside / delimiter). The output is reconstructed into grouped sentences using `__` as the group boundary separator. **Exact Match Accuracy: 58.18% on the official test set (post-challenge refined model)** --- ## Quick Start ```python from transformers import AutoTokenizer, AutoModelForTokenClassification import torch model_name = "manavdhamecha77/WG-GoogleMuril" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForTokenClassification.from_pretrained(model_name) device = torch.device("cuda" if torch.cuda.is_available() else "cpu") model.to(device) sentence = "राम ने बाजार से सब्जियां खरीदीं।" inputs = tokenizer(sentence, return_tensors="pt", truncation=True, max_length=128).to(device) with torch.no_grad(): outputs = model(**inputs) predictions = torch.argmax(outputs.logits, dim=-1)[0].tolist() tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0]) # Label map: {0: "B", 1: "I", 2: "O"} id2label = {0: "B", 1: "I", 2: "O"} for token, pred in zip(tokens, predictions): if token not in ["[CLS]", "[SEP]", "[PAD]"]: print(f"{token:20s} {id2label[pred]}") ``` --- ## Training Details The model is fine-tuned using `AutoModelForTokenClassification` with a class-weighted cross-entropy loss to address the dominant `O`-label imbalance (inverse-frequency weights upweight `B` and `I` labels). Labels are aligned to subword tokens using the tokenizer's `word_ids()` helper; only the first subword of each word is labeled, with subsequent subwords set to `-100`. | Parameter | Value | |---|---| | Optimizer | AdamW | | Learning Rate | 3×10⁻⁵ | | Batch Size | 8 (train/eval) | | Epochs | 20 | | Weight Decay | 0.01 | | Label Map | B:0, I:1, O:2 | | Hardware | H100 GPU (94GB) | **Training data:** Official BHASHA/IndicWG Hindi dataset (550 train / 100 dev / 226 test sentences), supplemented with 5K augmented sentences from a rule-based LWG finder applied to IndicCorp. --- ## Evaluation | Model | Dev EM (%) | Test EM (%) | |---|---|---| | **MuRIL (this model)** | **46.58** | **58.18** | | XLM-Roberta | 39.00 | 53.36 | | IndicBERT v2 | 35.40 | 52.73 | Evaluation metric: **Exact Match Accuracy** — a prediction is correct only if the entire reconstructed grouped sentence matches the gold output exactly. --- ## Limitations - Trained and evaluated on Hindi only; generalization to other Indic languages is not guaranteed. - Performance degrades on longer sentences (>40 words: ~20% EM). - Sensitive to subword tokenization boundaries, which can cause off-by-one grouping errors. - Gold annotations contain inconsistencies in multiword expressions and light-verb constructions, which caps achievable exact-match accuracy. --- ## Citation ```bibtex @inproceedings{dhamecha2025horizonwg, title = {Team Horizon at {BHASHA} Task 2: Fine-tuning Multilingual Transformers for Indic Word Grouping}, author = {Dhamecha, Manav and Damor, Gaurav and Choudhary, Sunil and Mishra, Pruthwik}, booktitle = {Proceedings of the 1st Workshop on Benchmarks, Harmonization, Annotation, and Standardization for Human-Centric AI in Indian Languages (BHASHA 2025)}, year = {2025}, url = {https://aclanthology.org/2025.bhasha-1.18/} } ```