---
language: id
tags:
  - text-generation
  - transformer
  - bahasa-indonesia
  - indonesian
  - slm
  - from-scratch
  - kbbi
license: mit
pipeline_tag: text-generation
---

# SLM Bahasa Indonesia 🇮🇩

A **Small Language Model** built entirely from scratch using PyTorch — trained on KBBI (Kamus Besar Bahasa Indonesia).

## Overview

This is a decoder-only Transformer (GPT-style) built from the ground up, demonstrating the full pipeline: custom tokenizer → model architecture → training → inference.

### Architecture

| Component | Detail |
|---|---|
| Type | Decoder-only Transformer |
| Parameters | **840K** (~3.5 MB) |
| Embedding dim | 128 |
| Layers | 2 |
| Attention heads | 4 |
| FFN dim | 256 |
| Context length | 64 tokens |
| Vocab size | 4,000 (BPE, KBBI-trained) |

### Modern Techniques Used
- **RoPE** (Rotary Position Embedding) — same as LLaMA/Qwen
- **RMSNorm** — more efficient than LayerNorm
- **SwiGLU** activation — same as LLaMA/Mistral
- **Weight tying** — embedding weights shared with output head
- **Cosine LR schedule** with warmup

## Quick Start

```python
import torch
from model import SmallLM
from bpe_tokenizer import BPETokenizer

model = SmallLM.from_pretrained("./")
tokenizer = BPETokenizer.from_pretrained("./")

# Generate text
ids = tokenizer.encode("indonesia adalah")
input_ids = torch.tensor([ids])
output = model.generate(input_ids, max_new_tokens=30, temperature=0.8)
print(tokenizer.decode(output[0].tolist()))
```

## Training Details

- **Data**: KBBI PDF (1,844 halaman, 21,627 entri, ~1.9M token) + curated Indonesian text corpus
- **Tokenizer**: Custom BPE trained on KBBI (4,000 vocab)
- **Optimizer**: AdamW (lr=1e-3, weight_decay=0.1)
- **Training**: Next-token prediction (causal language modeling)

## Limitations

This is a **proof-of-concept / educational model**:
- 840K params — can continue sentences but doesn't "understand"
- Trained on limited data — outputs may be incoherent
- Not suitable for production use
- Value is in the **architecture and pipeline**, not output quality

## Files

| File | Description |
|---|---|
| `model.py` | Transformer architecture (from scratch) |
| `model.safetensors` | Trained weights |
| `config.json` | Model configuration |
| `bpe_tokenizer.py` | Custom BPE tokenizer code |
| `vocab.json` | Tokenizer vocabulary |
| `merges.txt` | BPE merge rules |
| `generate.py` | Text generation script |
| `train.py` | Training script |

## What This Demonstrates

Building this project from scratch shows understanding of:
1. **Tokenization** — BPE algorithm, subword encoding
2. **Transformer architecture** — attention, FFN, normalization
3. **Modern techniques** — RoPE, RMSNorm, SwiGLU
4. **Training pipeline** — data loading, loss computation, optimization
5. **Text generation** — autoregressive decoding, sampling strategies
6. **Model deployment** — saving, loading, HuggingFace compatibility

## Author

Built by **Jekardah AI Lab** 🇮🇩

## License

MIT License