--- language: id tags: - text-generation - transformer - bahasa-indonesia - indonesian - slm - from-scratch - kbbi license: mit pipeline_tag: text-generation --- # SLM Bahasa Indonesia 🇮🇩 A **Small Language Model** built entirely from scratch using PyTorch — trained on KBBI (Kamus Besar Bahasa Indonesia). ## Overview This is a decoder-only Transformer (GPT-style) built from the ground up, demonstrating the full pipeline: custom tokenizer → model architecture → training → inference. ### Architecture | Component | Detail | |---|---| | Type | Decoder-only Transformer | | Parameters | **840K** (~3.5 MB) | | Embedding dim | 128 | | Layers | 2 | | Attention heads | 4 | | FFN dim | 256 | | Context length | 64 tokens | | Vocab size | 4,000 (BPE, KBBI-trained) | ### Modern Techniques Used - **RoPE** (Rotary Position Embedding) — same as LLaMA/Qwen - **RMSNorm** — more efficient than LayerNorm - **SwiGLU** activation — same as LLaMA/Mistral - **Weight tying** — embedding weights shared with output head - **Cosine LR schedule** with warmup ## Quick Start ```python import torch from model import SmallLM from bpe_tokenizer import BPETokenizer model = SmallLM.from_pretrained("./") tokenizer = BPETokenizer.from_pretrained("./") # Generate text ids = tokenizer.encode("indonesia adalah") input_ids = torch.tensor([ids]) output = model.generate(input_ids, max_new_tokens=30, temperature=0.8) print(tokenizer.decode(output[0].tolist())) ``` ## Training Details - **Data**: KBBI PDF (1,844 halaman, 21,627 entri, ~1.9M token) + curated Indonesian text corpus - **Tokenizer**: Custom BPE trained on KBBI (4,000 vocab) - **Optimizer**: AdamW (lr=1e-3, weight_decay=0.1) - **Training**: Next-token prediction (causal language modeling) ## Limitations This is a **proof-of-concept / educational model**: - 840K params — can continue sentences but doesn't "understand" - Trained on limited data — outputs may be incoherent - Not suitable for production use - Value is in the **architecture and pipeline**, not output quality ## Files | File | Description | |---|---| | `model.py` | Transformer architecture (from scratch) | | `model.safetensors` | Trained weights | | `config.json` | Model configuration | | `bpe_tokenizer.py` | Custom BPE tokenizer code | | `vocab.json` | Tokenizer vocabulary | | `merges.txt` | BPE merge rules | | `generate.py` | Text generation script | | `train.py` | Training script | ## What This Demonstrates Building this project from scratch shows understanding of: 1. **Tokenization** — BPE algorithm, subword encoding 2. **Transformer architecture** — attention, FFN, normalization 3. **Modern techniques** — RoPE, RMSNorm, SwiGLU 4. **Training pipeline** — data loading, loss computation, optimization 5. **Text generation** — autoregressive decoding, sampling strategies 6. **Model deployment** — saving, loading, HuggingFace compatibility ## Author Built by **Jekardah AI Lab** 🇮🇩 ## License MIT License