--- language: id tags: - text-generation - transformer - bahasa-indonesia - indonesian - slm - from-scratch - kbbi - pytorch - educational license: mit pipeline_tag: text-generation ---
# SLM Bahasa Indonesia **Small Language Model | Built from Scratch | Powered by KBBI** [![Python](https://img.shields.io/badge/Python-3.8+-3776AB?style=for-the-badge&logo=python&logoColor=white)](https://python.org) [![PyTorch](https://img.shields.io/badge/PyTorch-2.0+-EE4C2C?style=for-the-badge&logo=pytorch&logoColor=white)](https://pytorch.org) [![License](https://img.shields.io/badge/License-MIT-green?style=for-the-badge&logo=opensourceinitiative&logoColor=white)](LICENSE) [![HuggingFace](https://img.shields.io/badge/HuggingFace-Model-FFD21E?style=for-the-badge&logo=huggingface&logoColor=black)](https://huggingface.co/romizone/slm-bahasa-id) --- *A decoder-only Transformer (GPT-style) built entirely from the ground up using PyTorch, trained on Kamus Besar Bahasa Indonesia (KBBI).*
--- ## Overview This project demonstrates the **complete pipeline** of building a language model from scratch: ``` Custom BPE Tokenizer --> Transformer Architecture --> Training --> Inference --> Deployment ``` > **Note:** This is an educational/proof-of-concept model. The value is in the **architecture and pipeline**, not output quality. --- ## Architecture
ComponentDetail
TypeDecoder-only Transformer (GPT-style)
Parameters840K (~3.5 MB)
Embedding dim128
Layers2
Attention heads4
FFN dim256
Context length64 tokens
Vocab size4,000 (BPE, KBBI-trained)
### Modern Techniques | Technique | Description | Used By | |---|---|---| | **RoPE** | Rotary Position Embedding | LLaMA, Qwen | | **RMSNorm** | Root Mean Square Normalization | LLaMA, Gemma | | **SwiGLU** | Gated Linear Unit with Swish | LLaMA, Mistral | | **Weight Tying** | Shared embedding & output weights | GPT-2, LLaMA | | **Cosine LR** | Cosine schedule with warmup | Standard practice | --- ## Quick Start (Local) ```bash # Clone the repository git clone https://huggingface.co/romizone/slm-bahasa-id cd slm-bahasa-id # Install dependencies pip install torch safetensors ``` ```python import torch from model import SmallLM from bpe_tokenizer import BPETokenizer # Load model & tokenizer model = SmallLM.from_pretrained("./") tokenizer = BPETokenizer.from_pretrained("./") # Generate text ids = tokenizer.encode("indonesia adalah") input_ids = torch.tensor([ids]) output = model.generate(input_ids, max_new_tokens=30, temperature=0.8) print(tokenizer.decode(output[0].tolist())) ``` --- ## Run on Google Colab [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/) Buat notebook baru di Google Colab, lalu jalankan cell berikut: ### Cell 1 - Setup & Download Model ```python # Install dependencies !pip install torch safetensors huggingface_hub -q # Download model dari HuggingFace from huggingface_hub import snapshot_download model_dir = snapshot_download(repo_id="romizone/slm-bahasa-id") print(f"Model downloaded to: {model_dir}") ``` ### Cell 2 - Load Model ```python import sys, torch sys.path.insert(0, model_dir) from model import SmallLM from bpe_tokenizer import BPETokenizer model = SmallLM.from_pretrained(model_dir) tokenizer = BPETokenizer.from_pretrained(model_dir) print(f"Model loaded! Parameters: {model.count_parameters():,}") ``` ### Cell 3 - Generate Text ```python def generate_text(prompt, max_tokens=50, temperature=0.8, top_k=40): ids = tokenizer.encode(prompt.lower()) input_ids = torch.tensor([ids]) output = model.generate(input_ids, max_new_tokens=max_tokens, temperature=temperature, top_k=top_k) return tokenizer.decode(output[0].tolist()) # Coba berbagai prompt prompts = ["indonesia adalah", "pendidikan", "teknologi", "jakarta", "ekonomi", "kebudayaan", "demokrasi", "hutan"] for p in prompts: result = generate_text(p) print(f"Prompt: \"{p}\"") print(f"Output: {result[:100]}") print("-" * 60) ``` ### Cell 4 - Interactive Mode (Opsional) ```python # Interactive: ketik prompt sendiri while True: prompt = input("\nMasukkan prompt (ketik 'quit' untuk keluar): ") if prompt.lower() in ['quit', 'exit', 'q']: break result = generate_text(prompt, max_tokens=50) print(f"Output: {result}") ``` --- ## Run on Kaggle [![Kaggle](https://kaggle.com/static/images/open-in-kaggle.svg)](https://www.kaggle.com/) Buat notebook baru di Kaggle, lalu jalankan cell berikut: ### Cell 1 - Setup & Download Model ```python # Install huggingface_hub (torch & safetensors sudah pre-installed di Kaggle) !pip install huggingface_hub -q # Download model from huggingface_hub import snapshot_download model_dir = snapshot_download(repo_id="romizone/slm-bahasa-id") print(f"Model downloaded to: {model_dir}") ``` ### Cell 2 - Load Model ```python import sys, torch sys.path.insert(0, model_dir) from model import SmallLM from bpe_tokenizer import BPETokenizer # Gunakan GPU jika tersedia device = "cuda" if torch.cuda.is_available() else "cpu" print(f"Using device: {device}") model = SmallLM.from_pretrained(model_dir, device=device) tokenizer = BPETokenizer.from_pretrained(model_dir) print(f"Model loaded! Parameters: {model.count_parameters():,}") ``` ### Cell 3 - Generate Text ```python def generate_text(prompt, max_tokens=50, temperature=0.8, top_k=40): ids = tokenizer.encode(prompt.lower()) input_ids = torch.tensor([ids]).to(device) output = model.generate(input_ids, max_new_tokens=max_tokens, temperature=temperature, top_k=top_k) return tokenizer.decode(output[0].tolist()) # Coba berbagai prompt prompts = ["indonesia adalah", "pendidikan", "teknologi", "jakarta", "ekonomi", "kebudayaan", "demokrasi", "hutan"] for p in prompts: result = generate_text(p) print(f"Prompt: \"{p}\"") print(f"Output: {result[:100]}") print("-" * 60) ``` ### Cell 4 - Retrain Model di Kaggle (Opsional) ```python # Jika ingin retrain dengan data sendiri: import shutil, os # Copy file ke working directory work_dir = "/kaggle/working/slm" os.makedirs(work_dir, exist_ok=True) for f in os.listdir(model_dir): shutil.copy2(os.path.join(model_dir, f), os.path.join(work_dir, f)) os.chdir(work_dir) # Edit train.py sesuai kebutuhan, lalu: # !python train.py ``` > **Tips Kaggle:** > - Gunakan **GPU P100** (gratis) untuk training lebih cepat > - Aktifkan GPU: *Settings > Accelerator > GPU* > - Kaggle sudah pre-install PyTorch, jadi tidak perlu install ulang --- ## Training Details | | Detail | |---|---| | **Data** | KBBI PDF (1,844 halaman, 21,627 entri, ~1.9M token) + curated Indonesian corpus | | **Tokenizer** | Custom BPE trained on KBBI (4,000 vocab) | | **Optimizer** | AdamW (lr=1e-3, weight_decay=0.1) | | **Objective** | Next-token prediction (causal language modeling) | | **Gradient** | Clipping at norm 1.0 | | **Schedule** | Cosine decay with 30-step warmup | --- ## Project Structure ``` slm-bahasa-id/ model.py # Transformer architecture (from scratch) model.safetensors # Trained weights (~3.5 MB) config.json # Model configuration bpe_tokenizer.py # Custom BPE tokenizer implementation vocab.json # Tokenizer vocabulary (4,000 tokens) merges.txt # BPE merge rules tokenizer.json # HF-compatible tokenizer config generate.py # Text generation & demo script train.py # Full training pipeline README.md # This file ``` --- ## Limitations > This is a **proof-of-concept / educational model**: - **840K params** — can continue sentences but doesn't "understand" - **Limited data** — trained on KBBI definitions, outputs may be incoherent - **Not for production** — educational purpose only - **Short context** — 64 token context window --- ## What This Demonstrates Building this project from scratch demonstrates understanding of: | # | Topic | Details | |---|---|---| | 1 | **Tokenization** | BPE algorithm, subword encoding, vocabulary construction | | 2 | **Transformer** | Multi-head attention, FFN, normalization, residual connections | | 3 | **Modern Techniques** | RoPE, RMSNorm, SwiGLU — same as production LLMs | | 4 | **Training Pipeline** | Data loading, loss computation, gradient clipping, LR scheduling | | 5 | **Text Generation** | Autoregressive decoding, top-k, top-p, temperature sampling | | 6 | **Deployment** | Model serialization, HuggingFace Hub integration | --- ## Contributing Contributions are welcome! Feel free to: - Open issues for bugs or feature requests - Submit pull requests with improvements - Share your experiments and results --- ## Author
Built with by **Jekardah AI Lab**
--- ## License This project is licensed under the **MIT License** — see the [LICENSE](LICENSE) file for details.