--- language: - my license: apache-2.0 tags: - myanmar - burmese - word-segmentation - pos-tagging - xlm-roberta - bilstm - crf - token-classification - joint-learning base_model: xlm-roberta-base pipeline_tag: token-classification datasets: - final_analyzed_pos_xpos.conllu metrics: - f1 --- # XLM-RoBERTa-BiLSTM-CRF-Joint: Myanmar Word Segmentation + POS Tagging A joint model for **Myanmar (Burmese) Word Segmentation** and **Part-of-Speech Tagging** using a custom architecture built on top of `xlm-roberta-base` with an Asymmetric BiLSTM and Dual CRF heads. ## Model Architecture ``` XLM-RoBERTa (768) + Position Embedding (64) ↓ Asymmetric BiLSTM Forward LSTM : hidden=256 Backward LSTM : hidden=512 Concatenated : 768 ├── Head 1 – Word Segmentation CRF (4 BIES labels: B, I, E, S) └── Head 2 – POS Tagging CRF (68 labels: BIES × 17 UPOS tags) ``` **Key design choices:** - **Position embedding** encodes distance-from-end-of-sentence signals (dim=64) - **Asymmetric BiLSTM**: forward hidden=256, backward hidden=512 to capture stronger right-context for Myanmar syllable boundary detection - **Dual CRF heads**: separate CRF decoders for WS and POS to enforce label sequence constraints - **Joint loss**: `loss = CRF_WS + CRF_POS + 0.3 * (CE_WS + CE_POS)` - **WeightedRandomSampler** with 4x boost for rare POS tags (AUX, INTJ, SYM, PROPN, SCONJ, DET, X) - Trained with **AMP (FP16 Mixed Precision)** and **DataParallel (2x Tesla T4)** ## Dataset | Split | Sentences | Syllables | |-------|-----------|----------| | Train | 83,066 | ~2,498,960 | | Val | 10,383 | ~312,040 | | Test | 10,384 | ~313,305 | | **Total** | **103,833** | **3,124,305** | - **Split**: 80/10/10 (random_state=42) - **Input granularity**: Syllable-level (word-level CoNLL-U → syllable BIES conversion) - **POS tagset**: Universal Dependencies UPOS (17 tags) ## Labels ### Word Segmentation (WS) — 4 labels | Label | Meaning | |-------|---------| | B | Beginning syllable of a word | | I | Inside syllable of a word | | E | Ending syllable of a word | | S | Single-syllable word | ### POS Tags — 17 UPOS tags (each with B/I/E/S prefix = 68 total labels) `ADJ`, `ADP`, `ADV`, `AUX`, `CCONJ`, `DET`, `INTJ`, `NOUN`, `NUM`, `PART`, `PRON`, `PROPN`, `PUNCT`, `SCONJ`, `SYM`, `VERB`, `X` ## Training Details | Hyperparameter | Value | |----------------|-------| | Base model | `xlm-roberta-base` | | Max sequence length | 128 tokens | | Batch size | 48 (per GPU) | | Epochs | 15 (early stopping patience=3) | | Optimizer | AdamW | | Gradient clipping | max_norm=1.0 | | Precision | FP16 (AMP) | | Hardware | 2x Tesla T4 (DataParallel) | | Training time | ~5h 54m | ## Evaluation Results ### Test Set Performance | Task | Precision | Recall | F1 | |------|-----------|--------|----| | **Word Segmentation** | 0.9381 | 0.9451 | **0.9416** | | **POS Tagging** | 0.8947 | 0.9097 | **0.9021** | | **Combined (0.4*WS + 0.6*POS)** | — | — | **0.9179** | ### Per-class POS F1 (Test Set) | POS Tag | Precision | Recall | F1 | Support | |---------|-----------|--------|----|---------| | ADJ | 0.7993 | 0.8435 | 0.8208 | 4,576 | | ADP | 0.9474 | 0.9702 | 0.9586 | 24,631 | | ADV | 0.7989 | 0.8243 | 0.8114 | 3,114 | | AUX | 0.5986 | 0.6687 | 0.6317 | 1,775 | | CCONJ | 0.8528 | 0.8353 | 0.8440 | 3,600 | | DET | 0.9112 | 0.9067 | 0.9089 | 600 | | INTJ | 0.5915 | 0.8400 | 0.6942 | 50 | | NOUN | 0.8727 | 0.8849 | 0.8788 | 46,862 | | NUM | 0.9543 | 0.9685 | 0.9614 | 4,726 | | PART | 0.9281 | 0.9327 | 0.9304 | 42,569 | | PRON | 0.9444 | 0.9411 | 0.9428 | 4,227 | | PROPN | 0.7295 | 0.8153 | 0.7700 | 1,819 | | PUNCT | 0.9826 | 0.9893 | 0.9859 | 16,573 | | SCONJ | 0.7332 | 0.8365 | 0.7815 | 1,633 | | SYM | 0.8605 | 0.7551 | 0.8043 | 49 | | VERB | 0.8481 | 0.8649 | 0.8564 | 28,542 | | X | 0.5599 | 0.5470 | 0.5534 | 521 | | **micro avg** | **0.8947** | **0.9097** | **0.9021** | 185,867 | | **macro avg** | **0.8184** | **0.8485** | **0.8314** | 185,867 | ## Repository Files | File | Description | |------|-------------| | `best_model.pt` | Trained model weights (PyTorch state_dict) | | `config.json` | Model configuration (label maps, num_labels, task) | | `ws_label2id.json` | Word segmentation label → id mapping | | `ws_id2label.json` | Word segmentation id → label mapping | | `pos_label2id.json` | POS label → id mapping (68 labels) | | `pos_id2label.json` | POS id → label mapping (68 labels) | | `model_metadata.json` | Training metadata and final metrics | ## Usage ```python import torch import json from transformers import AutoTokenizer # Load tokenizer tokenizer = AutoTokenizer.from_pretrained("xlm-roberta-base") # Load label mappings with open("ws_id2label.json") as f: ws_id2label = {int(k): v for k, v in json.load(f).items()} with open("pos_id2label.json") as f: pos_id2label = {int(k): v for k, v in json.load(f).items()} # Rebuild model (same architecture as training) # ... (instantiate JointSegPosModel with same params) # Load weights model.load_state_dict(torch.load("best_model.pt", map_location="cpu")) model.eval() # Inference: input is a list of Myanmar syllables syllables = ["သ", "ာ", "း", "အေ", "လ", "ော", "ကျ", "န်း"] encoding = tokenizer( syllables, is_split_into_words=True, return_tensors="pt", truncation=True, max_length=128, padding="max_length" ) with torch.no_grad(): ws_preds, pos_preds = model(encoding["input_ids"], encoding["attention_mask"]) ``` ## Citation If you use this model, please cite: ``` @misc{sithu015_xlm_roberta_bilstm_crf_joint_2026, title = {XLM-RoBERTa-BiLSTM-CRF-Joint: Myanmar Word Segmentation and POS Tagging}, author = {sithu015}, year = {2026}, url = {https://huggingface.co/sithu015/XLM-RoBERTa-BiLSTM-CRF-Joint} } ``` ## License Apache 2.0 ## Acknowledgements - Base model: [xlm-roberta-base](https://huggingface.co/xlm-roberta-base) by Facebook AI - Dataset: Myanmar CoNLL-U corpus (103,833 sentences) - Training platform: Kaggle (2x Tesla T4 GPU) - Libraries: `transformers`, `pytorch-crf`, `seqeval`, `torch`