---
library_name: transformers
license: apache-2.0
language:
- km
base_model:
- albert/albert-base-v2
pipeline_tag: fill-mask
---

# Model Card: Khmer-ALBERT-Small

This model is a lightweight, efficient **ALBERT (A Lite BERT)** model pre-trained from scratch on a large-scale Khmer corpus. It is designed to provide high-performance natural language understanding for the Khmer language while maintaining a tiny memory footprint.

## Model Description

* **Model Type:** ALBERT (A Lite BERT)
* **Language:** Khmer (km)
* **Parameters:** 9.42 Million
* **Training Data:** 13 Million Khmer sentences
* **Base Architecture:** ALBERT v2 (Parameter sharing and factorized embedding transformations)
* **License:** Apache 2.0

## Intended Uses & Limitations

This model is a masked language model (MLM). It is highly efficient for deployment on edge devices or applications where latency is critical. It is suitable for:

* **Token Classification:** Named Entity Recognition (NER), Part-of-Speech (POS) tagging.
* **Text Classification:** Sentiment analysis, intent detection, topic categorization.
* **Feature Extraction:** Generating Khmer-specific word and sentence embeddings.
* **Language Modeling:** Filling masks and understanding Khmer syntax.

### How to Use

```python
import torch
from transformers import AlbertForMaskedLM, AlbertTokenizer
import sentencepiece as spm

# Load model and tokenizer
model = AlbertForMaskedLM.from_pretrained("seanghay/albert-khmer-small")
tokenizer = AlbertTokenizer.from_pretrained("seanghay/albert-khmer-small")
sp = spm.SentencePieceProcessor()
sp.load(tokenizer.vocab_file)

text = "ភ្នំពេញគឺជា[MASK]នៃប្រទេសកម្ពុជា។"
pieces = sp.encode_as_pieces(text)
ids = sp.encode_as_ids(text)
input_ids = torch.LongTensor([2] + ids + [3]).unsqueeze(0) # [CLS] + ids + [SEP]
attention_mask = torch.zeros_like(input_ids)

# Perform inference
with torch.no_grad():
    outputs = model(**inputs)
    logits = outputs.logits

# Locate the [MASK] token and extract predictions
mask_token_index = torch.where(inputs.input_ids == tokenizer.mask_token_id)[1]
mask_token_logits = logits[0, mask_token_index, :]

top_5_tokens = torch.topk(mask_token_logits, 5, dim=1).indices[0].tolist()

print(f"Original text: {text}")
print(f"Decoded input text embedding (should match original text): {sp.decode_ids(input_ids.squeeze().tolist())})
for i, token_id in enumerate(top_5_tokens):
    predicted_token = tokenizer.decode([token_id])
    print(f"{i + 1}. {text.replace('[MASK]', predicted_token)}")

```

## Training Data

The model was trained on a curated dataset of **13 Million Khmer sentences** sourced from various domains including news, social media, and web crawls, ensuring the model captures both formal and colloquial Khmer nuances.

## Technical Specifications

| Parameter | Value |
| --- | --- |
| `hidden_size` | 768 |
| `embedding_size` | 128 |
| `num_hidden_layers` | 12 |
| `num_attention_heads` | 12 |
| `intermediate_size` | 3072 |
| `max_position_embeddings` | 512 |
| `vocab_size` | 16,000 |

### Why ALBERT for Khmer?

By using **cross-layer parameter sharing**, this model achieves a hidden size of 768 (similar to BERT-base) but with only **~9.42M parameters**. This makes it significantly smaller and faster to load than standard BERT models while retaining strong linguistic representation capabilities.

## Evaluation Results

*The model demonstrates strong zero-shot capabilities in Khmer sentence completion as shown in the inference example.*

| Token Rank | Predicted Word | Full Sentence |
| --- | --- | --- |
| 1 | បេះដូង | ភ្នំពេញគឺជាបេះដូងនៃប្រទេសកម្ពុជា។ |
| 2 | ទឹកដី | ភ្នំពេញគឺជាទឹកដីនៃប្រទេសកម្ពុជា។ |
| 3 | រាជធានី | ភ្នំពេញគឺជារាជធានីនៃប្រទេសកម្ពុជា។ |

---

```
@misc{seanghay2024albertkhmersmall,
  author = {Seanghay Yath},
  title = {ALBERT Khmer Small: An efficient ALBERT model for the Khmer language},
  year = {2024},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/seanghay/albert-khmer-small}},
  note = {11.9M parameters, trained on 13M Khmer sentences}
}
```