manavdhamecha77 commited on
Commit
a090a8c
·
verified ·
1 Parent(s): ba29779

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +125 -0
README.md ADDED
@@ -0,0 +1,125 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ language:
4
+ - hi
5
+ metrics:
6
+ - exact_match
7
+ base_model:
8
+ - ai4bharat/IndicBERTv2-MLM-only
9
+ pipeline_tag: token-classification
10
+ library_name: transformers
11
+ tags:
12
+ - word-grouping
13
+ - indic-nlp
14
+ - hindi
15
+ - token-classification
16
+ - local-word-groups
17
+ - bio-tagging
18
+ ---
19
+
20
+ # WG-IndicBERT
21
+
22
+ A token classification model fine-tuned from [IndicBERT v2](https://huggingface.co/ai4bharat/IndicBERTv2-MLM-only) for **Indic Word Grouping** (Local Word Group identification). Developed as part of the BHASHA 2025 Shared Task 2: IndicWG.
23
+
24
+ - **Developed by:** Manav Dhamecha, Gaurav Damor, Sunil Choudhary, Pruthwik Mishra
25
+ - **License:** MIT
26
+ - **Base model:** [ai4bharat/IndicBERTv2-MLM-only](https://huggingface.co/ai4bharat/IndicBERTv2-MLM-only)
27
+ - **Paper:** [Team Horizon at BHASHA Task 2](https://aclanthology.org/2025.bhasha-1.18/)
28
+ - **Repository:** [manavdhamecha77/IndicGEC2025](https://github.com/manavdhamecha77/IndicGEC2025)
29
+
30
+ ---
31
+
32
+ ## What it does
33
+
34
+ Given an input sentence in Hindi, the model identifies **Local Word Groups (LWGs)** — semantically cohesive sequences of words that convey a single complete meaning (e.g., noun compounds, postpositional groups, verb groups with auxiliaries, light verb constructions).
35
+
36
+ The task is modeled as **BIO token classification** with three labels: `B` (beginning of a group), `I` (inside a group), `O` (outside / delimiter). The output is reconstructed into grouped sentences using `__` as the group boundary separator.
37
+
38
+ **Exact Match Accuracy: 52.73% on the official test set**
39
+
40
+ ---
41
+
42
+ ## Quick Start
43
+
44
+ ```python
45
+ from transformers import AutoTokenizer, AutoModelForTokenClassification
46
+ import torch
47
+
48
+ model_name = "manavdhamecha77/WG-IndicBERT"
49
+
50
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
51
+ model = AutoModelForTokenClassification.from_pretrained(model_name)
52
+
53
+ device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
54
+ model.to(device)
55
+
56
+ sentence = "राम ने बाजार से सब्जियां खरीदीं।"
57
+
58
+ inputs = tokenizer(sentence, return_tensors="pt", truncation=True, max_length=128).to(device)
59
+
60
+ with torch.no_grad():
61
+ outputs = model(**inputs)
62
+
63
+ predictions = torch.argmax(outputs.logits, dim=-1)[0].tolist()
64
+ tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
65
+
66
+ # Label map: {0: "B", 1: "I", 2: "O"}
67
+ id2label = {0: "B", 1: "I", 2: "O"}
68
+
69
+ for token, pred in zip(tokens, predictions):
70
+ if token not in ["[CLS]", "[SEP]", "[PAD]"]:
71
+ print(f"{token:20s} {id2label[pred]}")
72
+ ```
73
+
74
+ ---
75
+
76
+ ## Training Details
77
+
78
+ The model is fine-tuned using `AutoModelForTokenClassification` with a class-weighted cross-entropy loss to address the dominant `O`-label imbalance. Labels are aligned to subword tokens using the tokenizer's `word_ids()` helper; only the first subword of each word is labeled, with subsequent subwords set to `-100`.
79
+
80
+ | Parameter | Value |
81
+ |---|---|
82
+ | Optimizer | AdamW |
83
+ | Learning Rate | 3×10⁻⁵ |
84
+ | Batch Size | 8 (train/eval) |
85
+ | Epochs | 20 |
86
+ | Weight Decay | 0.01 |
87
+ | Label Map | B:0, I:1, O:2 |
88
+ | Hardware | H100 GPU (94GB) |
89
+
90
+ **Training data:** Official BHASHA/IndicWG Hindi dataset (550 train / 100 dev / 226 test sentences), supplemented with 5K augmented sentences from a rule-based LWG finder applied to IndicCorp.
91
+
92
+ ---
93
+
94
+ ## Evaluation
95
+
96
+ | Model | Dev EM (%) | Test EM (%) |
97
+ |---|---|---|
98
+ | MuRIL | 46.58 | 58.18 |
99
+ | XLM-Roberta | 39.00 | 53.36 |
100
+ | **IndicBERT v2 (this model)** | **35.40** | **52.73** |
101
+
102
+ Evaluation metric: **Exact Match Accuracy** — a prediction is correct only if the entire reconstructed grouped sentence matches the gold output exactly.
103
+
104
+ ---
105
+
106
+ ## Limitations
107
+
108
+ - Trained and evaluated on Hindi only; generalization to other Indic languages is not guaranteed.
109
+ - Performance degrades on longer sentences (>40 words).
110
+ - IndicBERT v2 was pretrained with MLM only, without task-specific fine-tuning on sequence labeling, which may explain its slightly lower performance compared to MuRIL on this task.
111
+ - Gold annotations contain inconsistencies in multiword expressions and light-verb constructions, which caps achievable exact-match accuracy.
112
+
113
+ ---
114
+
115
+ ## Citation
116
+
117
+ ```bibtex
118
+ @inproceedings{dhamecha2025horizonwg,
119
+ title = {Team Horizon at {BHASHA} Task 2: Fine-tuning Multilingual Transformers for Indic Word Grouping},
120
+ author = {Dhamecha, Manav and Damor, Gaurav and Choudhary, Sunil and Mishra, Pruthwik},
121
+ booktitle = {Proceedings of the 1st Workshop on Benchmarks, Harmonization, Annotation, and Standardization for Human-Centric AI in Indian Languages (BHASHA 2025)},
122
+ year = {2025},
123
+ url = {https://aclanthology.org/2025.bhasha-1.18/}
124
+ }
125
+ ```