metadata
language: id
tags:
- text-generation
- transformer
- bahasa-indonesia
- indonesian
- slm
- from-scratch
- kbbi
license: mit
pipeline_tag: text-generation
SLM Bahasa Indonesia ๐ฎ๐ฉ
A Small Language Model built entirely from scratch using PyTorch โ trained on KBBI (Kamus Besar Bahasa Indonesia).
Overview
This is a decoder-only Transformer (GPT-style) built from the ground up, demonstrating the full pipeline: custom tokenizer โ model architecture โ training โ inference.
Architecture
| Component | Detail |
|---|---|
| Type | Decoder-only Transformer |
| Parameters | 840K (~3.5 MB) |
| Embedding dim | 128 |
| Layers | 2 |
| Attention heads | 4 |
| FFN dim | 256 |
| Context length | 64 tokens |
| Vocab size | 4,000 (BPE, KBBI-trained) |
Modern Techniques Used
- RoPE (Rotary Position Embedding) โ same as LLaMA/Qwen
- RMSNorm โ more efficient than LayerNorm
- SwiGLU activation โ same as LLaMA/Mistral
- Weight tying โ embedding weights shared with output head
- Cosine LR schedule with warmup
Quick Start
import torch
from model import SmallLM
from bpe_tokenizer import BPETokenizer
model = SmallLM.from_pretrained("./")
tokenizer = BPETokenizer.from_pretrained("./")
# Generate text
ids = tokenizer.encode("indonesia adalah")
input_ids = torch.tensor([ids])
output = model.generate(input_ids, max_new_tokens=30, temperature=0.8)
print(tokenizer.decode(output[0].tolist()))
Training Details
- Data: KBBI PDF (1,844 halaman, 21,627 entri, ~1.9M token) + curated Indonesian text corpus
- Tokenizer: Custom BPE trained on KBBI (4,000 vocab)
- Optimizer: AdamW (lr=1e-3, weight_decay=0.1)
- Training: Next-token prediction (causal language modeling)
Limitations
This is a proof-of-concept / educational model:
- 840K params โ can continue sentences but doesn't "understand"
- Trained on limited data โ outputs may be incoherent
- Not suitable for production use
- Value is in the architecture and pipeline, not output quality
Files
| File | Description |
|---|---|
model.py |
Transformer architecture (from scratch) |
model.safetensors |
Trained weights |
config.json |
Model configuration |
bpe_tokenizer.py |
Custom BPE tokenizer code |
vocab.json |
Tokenizer vocabulary |
merges.txt |
BPE merge rules |
generate.py |
Text generation script |
train.py |
Training script |
What This Demonstrates
Building this project from scratch shows understanding of:
- Tokenization โ BPE algorithm, subword encoding
- Transformer architecture โ attention, FFN, normalization
- Modern techniques โ RoPE, RMSNorm, SwiGLU
- Training pipeline โ data loading, loss computation, optimization
- Text generation โ autoregressive decoding, sampling strategies
- Model deployment โ saving, loading, HuggingFace compatibility
Author
Built by Jekardah AI Lab ๐ฎ๐ฉ
License
MIT License