--- language: id tags: - text-generation - transformer - bahasa-indonesia - indonesian - slm - from-scratch - kbbi - pytorch - educational license: mit pipeline_tag: text-generation ---

SLM Bahasa Indonesia **Small Language Model | Built from Scratch | Powered by KBBI** [![Python](https://img.shields.io/badge/Python-3.8+-3776AB?style=for-the-badge&logo=python&logoColor=white)](https://python.org) [![PyTorch](https://img.shields.io/badge/PyTorch-2.0+-EE4C2C?style=for-the-badge&logo=pytorch&logoColor=white)](https://pytorch.org) [![License](https://img.shields.io/badge/License-MIT-green?style=for-the-badge&logo=opensourceinitiative&logoColor=white)](LICENSE) [![HuggingFace](https://img.shields.io/badge/HuggingFace-Model-FFD21E?style=for-the-badge&logo=huggingface&logoColor=black)](https://huggingface.co/romizone/slm-bahasa-id)

--- *A decoder-only Transformer (GPT-style) built entirely from the ground up using PyTorch, trained on Kamus Besar Bahasa Indonesia (KBBI).*

--- ##

Overview This project demonstrates the **complete pipeline** of building a language model from scratch: ``` Custom BPE Tokenizer --> Transformer Architecture --> Training --> Inference --> Deployment ``` > **Note:** This is an educational/proof-of-concept model. The value is in the **architecture and pipeline**, not output quality. --- ##

Architecture

Component	Detail
Type	Decoder-only Transformer (GPT-style)
Parameters	840K (~3.5 MB)
Embedding dim	128
Layers	2
Attention heads	4
FFN dim	256
Context length	64 tokens
Vocab size	4,000 (BPE, KBBI-trained)

###

Modern Techniques | Technique | Description | Used By | |---|---|---| |

**RoPE** | Rotary Position Embedding | LLaMA, Qwen | |

**RMSNorm** | Root Mean Square Normalization | LLaMA, Gemma | |

**SwiGLU** | Gated Linear Unit with Swish | LLaMA, Mistral | |

**Weight Tying** | Shared embedding & output weights | GPT-2, LLaMA | |

**Cosine LR** | Cosine schedule with warmup | Standard practice | --- ##

Quick Start (Local) ```bash # Clone the repository git clone https://huggingface.co/romizone/slm-bahasa-id cd slm-bahasa-id # Install dependencies pip install torch safetensors ``` ```python import torch from model import SmallLM from bpe_tokenizer import BPETokenizer # Load model & tokenizer model = SmallLM.from_pretrained("./") tokenizer = BPETokenizer.from_pretrained("./") # Generate text ids = tokenizer.encode("indonesia adalah") input_ids = torch.tensor([ids]) output = model.generate(input_ids, max_new_tokens=30, temperature=0.8) print(tokenizer.decode(output[0].tolist())) ``` --- ##

Run on Google Colab [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/) Buat notebook baru di Google Colab, lalu jalankan cell berikut: ### Cell 1 - Setup & Download Model ```python # Install dependencies !pip install torch safetensors huggingface_hub -q # Download model dari HuggingFace from huggingface_hub import snapshot_download model_dir = snapshot_download(repo_id="romizone/slm-bahasa-id") print(f"Model downloaded to: {model_dir}") ``` ### Cell 2 - Load Model ```python import sys, torch sys.path.insert(0, model_dir) from model import SmallLM from bpe_tokenizer import BPETokenizer model = SmallLM.from_pretrained(model_dir) tokenizer = BPETokenizer.from_pretrained(model_dir) print(f"Model loaded! Parameters: {model.count_parameters():,}") ``` ### Cell 3 - Generate Text ```python def generate_text(prompt, max_tokens=50, temperature=0.8, top_k=40): ids = tokenizer.encode(prompt.lower()) input_ids = torch.tensor([ids]) output = model.generate(input_ids, max_new_tokens=max_tokens, temperature=temperature, top_k=top_k) return tokenizer.decode(output[0].tolist()) # Coba berbagai prompt prompts = ["indonesia adalah", "pendidikan", "teknologi", "jakarta", "ekonomi", "kebudayaan", "demokrasi", "hutan"] for p in prompts: result = generate_text(p) print(f"Prompt: \"{p}\"") print(f"Output: {result[:100]}") print("-" * 60) ``` ### Cell 4 - Interactive Mode (Opsional) ```python # Interactive: ketik prompt sendiri while True: prompt = input("\nMasukkan prompt (ketik 'quit' untuk keluar): ") if prompt.lower() in ['quit', 'exit', 'q']: break result = generate_text(prompt, max_tokens=50) print(f"Output: {result}") ``` --- ##

Run on Kaggle [![Kaggle](https://kaggle.com/static/images/open-in-kaggle.svg)](https://www.kaggle.com/) Buat notebook baru di Kaggle, lalu jalankan cell berikut: ### Cell 1 - Setup & Download Model ```python # Install huggingface_hub (torch & safetensors sudah pre-installed di Kaggle) !pip install huggingface_hub -q # Download model from huggingface_hub import snapshot_download model_dir = snapshot_download(repo_id="romizone/slm-bahasa-id") print(f"Model downloaded to: {model_dir}") ``` ### Cell 2 - Load Model ```python import sys, torch sys.path.insert(0, model_dir) from model import SmallLM from bpe_tokenizer import BPETokenizer # Gunakan GPU jika tersedia device = "cuda" if torch.cuda.is_available() else "cpu" print(f"Using device: {device}") model = SmallLM.from_pretrained(model_dir, device=device) tokenizer = BPETokenizer.from_pretrained(model_dir) print(f"Model loaded! Parameters: {model.count_parameters():,}") ``` ### Cell 3 - Generate Text ```python def generate_text(prompt, max_tokens=50, temperature=0.8, top_k=40): ids = tokenizer.encode(prompt.lower()) input_ids = torch.tensor([ids]).to(device) output = model.generate(input_ids, max_new_tokens=max_tokens, temperature=temperature, top_k=top_k) return tokenizer.decode(output[0].tolist()) # Coba berbagai prompt prompts = ["indonesia adalah", "pendidikan", "teknologi", "jakarta", "ekonomi", "kebudayaan", "demokrasi", "hutan"] for p in prompts: result = generate_text(p) print(f"Prompt: \"{p}\"") print(f"Output: {result[:100]}") print("-" * 60) ``` ### Cell 4 - Retrain Model di Kaggle (Opsional) ```python # Jika ingin retrain dengan data sendiri: import shutil, os # Copy file ke working directory work_dir = "/kaggle/working/slm" os.makedirs(work_dir, exist_ok=True) for f in os.listdir(model_dir): shutil.copy2(os.path.join(model_dir, f), os.path.join(work_dir, f)) os.chdir(work_dir) # Edit train.py sesuai kebutuhan, lalu: # !python train.py ``` > **Tips Kaggle:** > - Gunakan **GPU P100** (gratis) untuk training lebih cepat > - Aktifkan GPU: *Settings > Accelerator > GPU* > - Kaggle sudah pre-install PyTorch, jadi tidak perlu install ulang --- ##

Training Details | | Detail | |---|---| |

**Data** | KBBI PDF (1,844 halaman, 21,627 entri, ~1.9M token) + curated Indonesian corpus | |

**Tokenizer** | Custom BPE trained on KBBI (4,000 vocab) | |

**Optimizer** | AdamW (lr=1e-3, weight_decay=0.1) | |

**Objective** | Next-token prediction (causal language modeling) | |

**Gradient** | Clipping at norm 1.0 | |

**Schedule** | Cosine decay with 30-step warmup | --- ##

Project Structure ``` slm-bahasa-id/ model.py # Transformer architecture (from scratch) model.safetensors # Trained weights (~3.5 MB) config.json # Model configuration bpe_tokenizer.py # Custom BPE tokenizer implementation vocab.json # Tokenizer vocabulary (4,000 tokens) merges.txt # BPE merge rules tokenizer.json # HF-compatible tokenizer config generate.py # Text generation & demo script train.py # Full training pipeline README.md # This file ``` --- ##

Limitations > This is a **proof-of-concept / educational model**: -

**840K params** — can continue sentences but doesn't "understand" -

**Limited data** — trained on KBBI definitions, outputs may be incoherent -

**Not for production** — educational purpose only -

**Short context** — 64 token context window --- ##

What This Demonstrates Building this project from scratch demonstrates understanding of: | # | Topic | Details | |---|---|---| | 1 |

**Tokenization** | BPE algorithm, subword encoding, vocabulary construction | | 2 |

**Transformer** | Multi-head attention, FFN, normalization, residual connections | | 3 |

**Modern Techniques** | RoPE, RMSNorm, SwiGLU — same as production LLMs | | 4 |

**Training Pipeline** | Data loading, loss computation, gradient clipping, LR scheduling | | 5 |

**Text Generation** | Autoregressive decoding, top-k, top-p, temperature sampling | | 6 |

**Deployment** | Model serialization, HuggingFace Hub integration | --- ##

Contributing Contributions are welcome! Feel free to: - Open issues for bugs or feature requests - Submit pull requests with improvements - Share your experiments and results --- ##

Author

Built with

by **Jekardah AI Lab**

--- ##

License This project is licensed under the **MIT License** — see the [LICENSE](LICENSE) file for details.