--- library_name: transformers license: apache-2.0 language: - km base_model: - albert/albert-base-v2 pipeline_tag: fill-mask --- # Model Card: Khmer-ALBERT-Small This model is a lightweight, efficient **ALBERT (A Lite BERT)** model pre-trained from scratch on a large-scale Khmer corpus. It is designed to provide high-performance natural language understanding for the Khmer language while maintaining a tiny memory footprint. ## Model Description * **Model Type:** ALBERT (A Lite BERT) * **Language:** Khmer (km) * **Parameters:** 9.42 Million * **Training Data:** 13 Million Khmer sentences * **Base Architecture:** ALBERT v2 (Parameter sharing and factorized embedding transformations) * **License:** Apache 2.0 ## Intended Uses & Limitations This model is a masked language model (MLM). It is highly efficient for deployment on edge devices or applications where latency is critical. It is suitable for: * **Token Classification:** Named Entity Recognition (NER), Part-of-Speech (POS) tagging. * **Text Classification:** Sentiment analysis, intent detection, topic categorization. * **Feature Extraction:** Generating Khmer-specific word and sentence embeddings. * **Language Modeling:** Filling masks and understanding Khmer syntax. ### How to Use ```python import torch from transformers import AlbertForMaskedLM, AlbertTokenizer import sentencepiece as spm # Load model and tokenizer model = AlbertForMaskedLM.from_pretrained("seanghay/albert-khmer-small") tokenizer = AlbertTokenizer.from_pretrained("seanghay/albert-khmer-small") sp = spm.SentencePieceProcessor() sp.load(tokenizer.vocab_file) text = "ភ្នំពេញគឺជា[MASK]នៃប្រទេសកម្ពុជា។" pieces = sp.encode_as_pieces(text) ids = sp.encode_as_ids(text) input_ids = torch.LongTensor([2] + ids + [3]).unsqueeze(0) # [CLS] + ids + [SEP] attention_mask = torch.zeros_like(input_ids) # Perform inference with torch.no_grad(): outputs = model(**inputs) logits = outputs.logits # Locate the [MASK] token and extract predictions mask_token_index = torch.where(inputs.input_ids == tokenizer.mask_token_id)[1] mask_token_logits = logits[0, mask_token_index, :] top_5_tokens = torch.topk(mask_token_logits, 5, dim=1).indices[0].tolist() print(f"Original text: {text}") print(f"Decoded input text embedding (should match original text): {sp.decode_ids(input_ids.squeeze().tolist())}) for i, token_id in enumerate(top_5_tokens): predicted_token = tokenizer.decode([token_id]) print(f"{i + 1}. {text.replace('[MASK]', predicted_token)}") ``` ## Training Data The model was trained on a curated dataset of **13 Million Khmer sentences** sourced from various domains including news, social media, and web crawls, ensuring the model captures both formal and colloquial Khmer nuances. ## Technical Specifications | Parameter | Value | | --- | --- | | `hidden_size` | 768 | | `embedding_size` | 128 | | `num_hidden_layers` | 12 | | `num_attention_heads` | 12 | | `intermediate_size` | 3072 | | `max_position_embeddings` | 512 | | `vocab_size` | 16,000 | ### Why ALBERT for Khmer? By using **cross-layer parameter sharing**, this model achieves a hidden size of 768 (similar to BERT-base) but with only **~9.42M parameters**. This makes it significantly smaller and faster to load than standard BERT models while retaining strong linguistic representation capabilities. ## Evaluation Results *The model demonstrates strong zero-shot capabilities in Khmer sentence completion as shown in the inference example.* | Token Rank | Predicted Word | Full Sentence | | --- | --- | --- | | 1 | បេះដូង | ភ្នំពេញគឺជាបេះដូងនៃប្រទេសកម្ពុជា។ | | 2 | ទឹកដី | ភ្នំពេញគឺជាទឹកដីនៃប្រទេសកម្ពុជា។ | | 3 | រាជធានី | ភ្នំពេញគឺជារាជធានីនៃប្រទេសកម្ពុជា។ | --- ``` @misc{seanghay2024albertkhmersmall, author = {Seanghay Yath}, title = {ALBERT Khmer Small: An efficient ALBERT model for the Khmer language}, year = {2024}, publisher = {Hugging Face}, howpublished = {\url{https://huggingface.co/seanghay/albert-khmer-small}}, note = {11.9M parameters, trained on 13M Khmer sentences} } ```