---
language:
- my
license: apache-2.0
tags:
- myanmar
- burmese
- word-segmentation
- pos-tagging
- xlm-roberta
- bilstm
- crf
- token-classification
- joint-learning
base_model: xlm-roberta-base
pipeline_tag: token-classification
datasets:
- final_analyzed_pos_xpos.conllu
metrics:
- f1
---

# XLM-RoBERTa-BiLSTM-CRF-Joint: Myanmar Word Segmentation + POS Tagging

A joint model for **Myanmar (Burmese) Word Segmentation** and **Part-of-Speech Tagging** using a custom architecture built on top of `xlm-roberta-base` with an Asymmetric BiLSTM and Dual CRF heads.

## Model Architecture

```
XLM-RoBERTa (768) + Position Embedding (64)
    ↓
Asymmetric BiLSTM
  Forward LSTM  : hidden=256
  Backward LSTM : hidden=512
  Concatenated  : 768
    ├── Head 1 – Word Segmentation CRF  (4 BIES labels: B, I, E, S)
    └── Head 2 – POS Tagging CRF        (68 labels: BIES × 17 UPOS tags)
```

**Key design choices:**
- **Position embedding** encodes distance-from-end-of-sentence signals (dim=64)
- **Asymmetric BiLSTM**: forward hidden=256, backward hidden=512 to capture stronger right-context for Myanmar syllable boundary detection
- **Dual CRF heads**: separate CRF decoders for WS and POS to enforce label sequence constraints
- **Joint loss**: `loss = CRF_WS + CRF_POS + 0.3 * (CE_WS + CE_POS)`
- **WeightedRandomSampler** with 4x boost for rare POS tags (AUX, INTJ, SYM, PROPN, SCONJ, DET, X)
- Trained with **AMP (FP16 Mixed Precision)** and **DataParallel (2x Tesla T4)**

## Dataset

| Split | Sentences | Syllables |
|-------|-----------|----------|
| Train | 83,066    | ~2,498,960 |
| Val   | 10,383    | ~312,040   |
| Test  | 10,384    | ~313,305   |
| **Total** | **103,833** | **3,124,305** |

- **Split**: 80/10/10 (random_state=42)
- **Input granularity**: Syllable-level (word-level CoNLL-U → syllable BIES conversion)
- **POS tagset**: Universal Dependencies UPOS (17 tags)

## Labels

### Word Segmentation (WS) — 4 labels
| Label | Meaning |
|-------|---------|
| B | Beginning syllable of a word |
| I | Inside syllable of a word |
| E | Ending syllable of a word |
| S | Single-syllable word |

### POS Tags — 17 UPOS tags (each with B/I/E/S prefix = 68 total labels)
`ADJ`, `ADP`, `ADV`, `AUX`, `CCONJ`, `DET`, `INTJ`, `NOUN`, `NUM`, `PART`, `PRON`, `PROPN`, `PUNCT`, `SCONJ`, `SYM`, `VERB`, `X`

## Training Details

| Hyperparameter | Value |
|----------------|-------|
| Base model | `xlm-roberta-base` |
| Max sequence length | 128 tokens |
| Batch size | 48 (per GPU) |
| Epochs | 15 (early stopping patience=3) |
| Optimizer | AdamW |
| Gradient clipping | max_norm=1.0 |
| Precision | FP16 (AMP) |
| Hardware | 2x Tesla T4 (DataParallel) |
| Training time | ~5h 54m |

## Evaluation Results

### Test Set Performance

| Task | Precision | Recall | F1 |
|------|-----------|--------|----|
| **Word Segmentation** | 0.9381 | 0.9451 | **0.9416** |
| **POS Tagging** | 0.8947 | 0.9097 | **0.9021** |
| **Combined (0.4*WS + 0.6*POS)** | — | — | **0.9179** |

### Per-class POS F1 (Test Set)

| POS Tag | Precision | Recall | F1 | Support |
|---------|-----------|--------|----|---------|
| ADJ | 0.7993 | 0.8435 | 0.8208 | 4,576 |
| ADP | 0.9474 | 0.9702 | 0.9586 | 24,631 |
| ADV | 0.7989 | 0.8243 | 0.8114 | 3,114 |
| AUX | 0.5986 | 0.6687 | 0.6317 | 1,775 |
| CCONJ | 0.8528 | 0.8353 | 0.8440 | 3,600 |
| DET | 0.9112 | 0.9067 | 0.9089 | 600 |
| INTJ | 0.5915 | 0.8400 | 0.6942 | 50 |
| NOUN | 0.8727 | 0.8849 | 0.8788 | 46,862 |
| NUM | 0.9543 | 0.9685 | 0.9614 | 4,726 |
| PART | 0.9281 | 0.9327 | 0.9304 | 42,569 |
| PRON | 0.9444 | 0.9411 | 0.9428 | 4,227 |
| PROPN | 0.7295 | 0.8153 | 0.7700 | 1,819 |
| PUNCT | 0.9826 | 0.9893 | 0.9859 | 16,573 |
| SCONJ | 0.7332 | 0.8365 | 0.7815 | 1,633 |
| SYM | 0.8605 | 0.7551 | 0.8043 | 49 |
| VERB | 0.8481 | 0.8649 | 0.8564 | 28,542 |
| X | 0.5599 | 0.5470 | 0.5534 | 521 |
| **micro avg** | **0.8947** | **0.9097** | **0.9021** | 185,867 |
| **macro avg** | **0.8184** | **0.8485** | **0.8314** | 185,867 |

## Repository Files

| File | Description |
|------|-------------|
| `best_model.pt` | Trained model weights (PyTorch state_dict) |
| `config.json` | Model configuration (label maps, num_labels, task) |
| `ws_label2id.json` | Word segmentation label → id mapping |
| `ws_id2label.json` | Word segmentation id → label mapping |
| `pos_label2id.json` | POS label → id mapping (68 labels) |
| `pos_id2label.json` | POS id → label mapping (68 labels) |
| `model_metadata.json` | Training metadata and final metrics |

## Usage

```python
import torch
import json
from transformers import AutoTokenizer

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("xlm-roberta-base")

# Load label mappings
with open("ws_id2label.json") as f:
    ws_id2label = {int(k): v for k, v in json.load(f).items()}
with open("pos_id2label.json") as f:
    pos_id2label = {int(k): v for k, v in json.load(f).items()}

# Rebuild model (same architecture as training)
# ... (instantiate JointSegPosModel with same params)

# Load weights
model.load_state_dict(torch.load("best_model.pt", map_location="cpu"))
model.eval()

# Inference: input is a list of Myanmar syllables
syllables = ["သ", "ာ", "း", "အေ", "လ", "ော", "ကျ", "န်း"]
encoding = tokenizer(
    syllables,
    is_split_into_words=True,
    return_tensors="pt",
    truncation=True,
    max_length=128,
    padding="max_length"
)

with torch.no_grad():
    ws_preds, pos_preds = model(encoding["input_ids"], encoding["attention_mask"])
```

## Citation

If you use this model, please cite:

```
@misc{sithu015_xlm_roberta_bilstm_crf_joint_2026,
  title  = {XLM-RoBERTa-BiLSTM-CRF-Joint: Myanmar Word Segmentation and POS Tagging},
  author = {sithu015},
  year   = {2026},
  url    = {https://huggingface.co/sithu015/XLM-RoBERTa-BiLSTM-CRF-Joint}
}
```

## License

Apache 2.0

## Acknowledgements

- Base model: [xlm-roberta-base](https://huggingface.co/xlm-roberta-base) by Facebook AI
- Dataset: Myanmar CoNLL-U corpus (103,833 sentences)
- Training platform: Kaggle (2x Tesla T4 GPU)
- Libraries: `transformers`, `pytorch-crf`, `seqeval`, `torch`