romizone
/

slm-bahasa-id

@@ -8,45 +8,89 @@ tags:
   - slm
   - from-scratch
   - kbbi
 license: mit
 pipeline_tag: text-generation
 ---
-# SLM Bahasa Indonesia 🇮🇩
-A **Small Language Model** built entirely from scratch using PyTorch — trained on KBBI (Kamus Besar Bahasa Indonesia).
-## Overview
-This is a decoder-only Transformer (GPT-style) built from the ground up, demonstrating the full pipeline: custom tokenizer → model architecture → training → inference.
-### Architecture
-| Component | Detail |
-|---|---|
-| Type | Decoder-only Transformer |
-| Parameters | **840K** (~3.5 MB) |
-| Embedding dim | 128 |
-| Layers | 2 |
-| Attention heads | 4 |
-| FFN dim | 256 |
-| Context length | 64 tokens |
-| Vocab size | 4,000 (BPE, KBBI-trained) |
-### Modern Techniques Used
-- **RoPE** (Rotary Position Embedding) — same as LLaMA/Qwen
-- **RMSNorm** — more efficient than LayerNorm
-- **SwiGLU** activation — same as LLaMA/Mistral
-- **Weight tying** — embedding weights shared with output head
-- **Cosine LR schedule** with warmup
-## Quick Start
 ```python
 import torch
 from model import SmallLM
 from bpe_tokenizer import BPETokenizer
 model = SmallLM.from_pretrained("./")
 tokenizer = BPETokenizer.from_pretrained("./")
@@ -57,48 +101,233 @@ output = model.generate(input_ids, max_new_tokens=30, temperature=0.8)
 print(tokenizer.decode(output[0].tolist()))
 ```
-## Training Details
-- **Data**: KBBI PDF (1,844 halaman, 21,627 entri, ~1.9M token) + curated Indonesian text corpus
-- **Tokenizer**: Custom BPE trained on KBBI (4,000 vocab)
-- **Optimizer**: AdamW (lr=1e-3, weight_decay=0.1)
-- **Training**: Next-token prediction (causal language modeling)
-## Limitations
-This is a **proof-of-concept / educational model**:
-- 840K params — can continue sentences but doesn't "understand"
-- Trained on limited data — outputs may be incoherent
-- Not suitable for production use
-- Value is in the **architecture and pipeline**, not output quality
-## Files
-| File | Description |
 |---|---|
-| `model.py` | Transformer architecture (from scratch) |
-| `model.safetensors` | Trained weights |
-| `config.json` | Model configuration |
-| `bpe_tokenizer.py` | Custom BPE tokenizer code |
-| `vocab.json` | Tokenizer vocabulary |
-| `merges.txt` | BPE merge rules |
-| `generate.py` | Text generation script |
-| `train.py` | Training script |
-## What This Demonstrates
-Building this project from scratch shows understanding of:
-1. **Tokenization** — BPE algorithm, subword encoding
-2. **Transformer architecture** — attention, FFN, normalization
-3. **Modern techniques** — RoPE, RMSNorm, SwiGLU
-4. **Training pipeline** — data loading, loss computation, optimization
-5. **Text generation** — autoregressive decoding, sampling strategies
-6. **Model deployment** — saving, loading, HuggingFace compatibility
-## Author
-Built by **Jekardah AI Lab** 🇮🇩
-## License
-MIT License

   - slm
   - from-scratch
   - kbbi
+  - pytorch
+  - educational
 license: mit
 pipeline_tag: text-generation
 ---
+<div align="center">
+# <img src="https://em-content.zobj.net/source/twitter/376/flag-indonesia_1f1ee-1f1e9.png" width="36"/> SLM Bahasa Indonesia
+**Small Language Model | Built from Scratch | Powered by KBBI**
+[![Python](https://img.shields.io/badge/Python-3.8+-3776AB?style=for-the-badge&logo=python&logoColor=white)](https://python.org)
+[![PyTorch](https://img.shields.io/badge/PyTorch-2.0+-EE4C2C?style=for-the-badge&logo=pytorch&logoColor=white)](https://pytorch.org)
+[![License](https://img.shields.io/badge/License-MIT-green?style=for-the-badge&logo=opensourceinitiative&logoColor=white)](LICENSE)
+[![HuggingFace](https://img.shields.io/badge/HuggingFace-Model-FFD21E?style=for-the-badge&logo=huggingface&logoColor=black)](https://huggingface.co/romizone/slm-bahasa-id)
+<img src="https://img.shields.io/badge/Parameters-840K-blue?style=flat-square"/> <img src="https://img.shields.io/badge/Model_Size-3.5_MB-blue?style=flat-square"/> <img src="https://img.shields.io/badge/Vocab-4,000_BPE-blue?style=flat-square"/> <img src="https://img.shields.io/badge/Data-KBBI_1,844_pages-blue?style=flat-square"/>
+---
+*A decoder-only Transformer (GPT-style) built entirely from the ground up using PyTorch,
+trained on Kamus Besar Bahasa Indonesia (KBBI).*
+</div>
+---
+## <img src="https://em-content.zobj.net/source/twitter/376/rocket_1f680.png" width="24"/> Overview
+This project demonstrates the **complete pipeline** of building a language model from scratch:
+```
+Custom BPE Tokenizer --> Transformer Architecture --> Training --> Inference --> Deployment
+```
+> **Note:** This is an educational/proof-of-concept model. The value is in the **architecture and pipeline**, not output quality.
+---
+## <img src="https://em-content.zobj.net/source/twitter/376/building-construction_1f3d7-fe0f.png" width="24"/> Architecture
+<table>
+<tr><td><b>Component</b></td><td><b>Detail</b></td></tr>
+<tr><td><img src="https://em-content.zobj.net/source/twitter/376/brain_1f9e0.png" width="16"/> Type</td><td>Decoder-only Transformer (GPT-style)</td></tr>
+<tr><td><img src="https://em-content.zobj.net/source/twitter/376/bar-chart_1f4ca.png" width="16"/> Parameters</td><td><b>840K</b> (~3.5 MB)</td></tr>
+<tr><td><img src="https://em-content.zobj.net/source/twitter/376/gear_2699-fe0f.png" width="16"/> Embedding dim</td><td>128</td></tr>
+<tr><td><img src="https://em-content.zobj.net/source/twitter/376/bricks_1f9f1.png" width="16"/> Layers</td><td>2</td></tr>
+<tr><td><img src="https://em-content.zobj.net/source/twitter/376/eyes_1f440.png" width="16"/> Attention heads</td><td>4</td></tr>
+<tr><td><img src="https://em-content.zobj.net/source/twitter/376/zap_26a1.png" width="16"/> FFN dim</td><td>256</td></tr>
+<tr><td><img src="https://em-content.zobj.net/source/twitter/376/straight-ruler_1f4cf.png" width="16"/> Context length</td><td>64 tokens</td></tr>
+<tr><td><img src="https://em-content.zobj.net/source/twitter/376/books_1f4da.png" width="16"/> Vocab size</td><td>4,000 (BPE, KBBI-trained)</td></tr>
+</table>
+### <img src="https://em-content.zobj.net/source/twitter/376/sparkles_2728.png" width="20"/> Modern Techniques
+| Technique | Description | Used By |
+|---|---|---|
+| <img src="https://em-content.zobj.net/source/twitter/376/cyclone_1f300.png" width="16"/> **RoPE** | Rotary Position Embedding | LLaMA, Qwen |
+| <img src="https://em-content.zobj.net/source/twitter/376/high-voltage_26a1.png" width="16"/> **RMSNorm** | Root Mean Square Normalization | LLaMA, Gemma |
+| <img src="https://em-content.zobj.net/source/twitter/376/fire_1f525.png" width="16"/> **SwiGLU** | Gated Linear Unit with Swish | LLaMA, Mistral |
+| <img src="https://em-content.zobj.net/source/twitter/376/link_1f517.png" width="16"/> **Weight Tying** | Shared embedding & output weights | GPT-2, LLaMA |
+| <img src="https://em-content.zobj.net/source/twitter/376/chart-decreasing_1f4c9.png" width="16"/> **Cosine LR** | Cosine schedule with warmup | Standard practice |
+---
+## <img src="https://em-content.zobj.net/source/twitter/376/laptop_1f4bb.png" width="24"/> Quick Start (Local)
+```bash
+# Clone the repository
+git clone https://huggingface.co/romizone/slm-bahasa-id
+cd slm-bahasa-id
+# Install dependencies
+pip install torch safetensors
+```
 ```python
 import torch
 from model import SmallLM
 from bpe_tokenizer import BPETokenizer
+# Load model & tokenizer
 model = SmallLM.from_pretrained("./")
 tokenizer = BPETokenizer.from_pretrained("./")
 print(tokenizer.decode(output[0].tolist()))
 ```
+---
+## <img src="https://em-content.zobj.net/source/twitter/376/test-tube_1f9ea.png" width="24"/> Run on Google Colab
+[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/)
+Buat notebook baru di Google Colab, lalu jalankan cell berikut:
+### Cell 1 - Setup & Download Model
+```python
+# Install dependencies
+!pip install torch safetensors huggingface_hub -q
+# Download model dari HuggingFace
+from huggingface_hub import snapshot_download
+model_dir = snapshot_download(repo_id="romizone/slm-bahasa-id")
+print(f"Model downloaded to: {model_dir}")
+```
+### Cell 2 - Load Model
+```python
+import sys, torch
+sys.path.insert(0, model_dir)
+from model import SmallLM
+from bpe_tokenizer import BPETokenizer
+model = SmallLM.from_pretrained(model_dir)
+tokenizer = BPETokenizer.from_pretrained(model_dir)
+print(f"Model loaded! Parameters: {model.count_parameters():,}")
+```
+### Cell 3 - Generate Text
+```python
+def generate_text(prompt, max_tokens=50, temperature=0.8, top_k=40):
+    ids = tokenizer.encode(prompt.lower())
+    input_ids = torch.tensor([ids])
+    output = model.generate(input_ids, max_new_tokens=max_tokens,
+                            temperature=temperature, top_k=top_k)
+    return tokenizer.decode(output[0].tolist())
+# Coba berbagai prompt
+prompts = ["indonesia adalah", "pendidikan", "teknologi", "jakarta",
+           "ekonomi", "kebudayaan", "demokrasi", "hutan"]
+for p in prompts:
+    result = generate_text(p)
+    print(f"Prompt: \"{p}\"")
+    print(f"Output: {result[:100]}")
+    print("-" * 60)
+```
+### Cell 4 - Interactive Mode (Opsional)
+```python
+# Interactive: ketik prompt sendiri
+while True:
+    prompt = input("\nMasukkan prompt (ketik 'quit' untuk keluar): ")
+    if prompt.lower() in ['quit', 'exit', 'q']:
+        break
+    result = generate_text(prompt, max_tokens=50)
+    print(f"Output: {result}")
+```
+---
+## <img src="https://em-content.zobj.net/source/twitter/376/gem-stone_1f48e.png" width="24"/> Run on Kaggle
+[![Kaggle](https://kaggle.com/static/images/open-in-kaggle.svg)](https://www.kaggle.com/)
+Buat notebook baru di Kaggle, lalu jalankan cell berikut:
+### Cell 1 - Setup & Download Model
+```python
+# Install huggingface_hub (torch & safetensors sudah pre-installed di Kaggle)
+!pip install huggingface_hub -q
+# Download model
+from huggingface_hub import snapshot_download
+model_dir = snapshot_download(repo_id="romizone/slm-bahasa-id")
+print(f"Model downloaded to: {model_dir}")
+```
+### Cell 2 - Load Model
+```python
+import sys, torch
+sys.path.insert(0, model_dir)
+from model import SmallLM
+from bpe_tokenizer import BPETokenizer
+# Gunakan GPU jika tersedia
+device = "cuda" if torch.cuda.is_available() else "cpu"
+print(f"Using device: {device}")
+model = SmallLM.from_pretrained(model_dir, device=device)
+tokenizer = BPETokenizer.from_pretrained(model_dir)
+print(f"Model loaded! Parameters: {model.count_parameters():,}")
+```
+### Cell 3 - Generate Text
+```python
+def generate_text(prompt, max_tokens=50, temperature=0.8, top_k=40):
+    ids = tokenizer.encode(prompt.lower())
+    input_ids = torch.tensor([ids]).to(device)
+    output = model.generate(input_ids, max_new_tokens=max_tokens,
+                            temperature=temperature, top_k=top_k)
+    return tokenizer.decode(output[0].tolist())
+# Coba berbagai prompt
+prompts = ["indonesia adalah", "pendidikan", "teknologi", "jakarta",
+           "ekonomi", "kebudayaan", "demokrasi", "hutan"]
+for p in prompts:
+    result = generate_text(p)
+    print(f"Prompt: \"{p}\"")
+    print(f"Output: {result[:100]}")
+    print("-" * 60)
+```
+### Cell 4 - Retrain Model di Kaggle (Opsional)
+```python
+# Jika ingin retrain dengan data sendiri:
+import shutil, os
+# Copy file ke working directory
+work_dir = "/kaggle/working/slm"
+os.makedirs(work_dir, exist_ok=True)
+for f in os.listdir(model_dir):
+    shutil.copy2(os.path.join(model_dir, f), os.path.join(work_dir, f))
+os.chdir(work_dir)
+# Edit train.py sesuai kebutuhan, lalu:
+# !python train.py
+```
+> **Tips Kaggle:**
+> - Gunakan **GPU P100** (gratis) untuk training lebih cepat
+> - Aktifkan GPU: *Settings > Accelerator > GPU*
+> - Kaggle sudah pre-install PyTorch, jadi tidak perlu install ulang
+---
+## <img src="https://em-content.zobj.net/source/twitter/376/graduation-cap_1f393.png" width="24"/> Training Details
+| | Detail |
 |---|---|
+| <img src="https://em-content.zobj.net/source/twitter/376/books_1f4da.png" width="16"/> **Data** | KBBI PDF (1,844 halaman, 21,627 entri, ~1.9M token) + curated Indonesian corpus |
+| <img src="https://em-content.zobj.net/source/twitter/376/abacus_1f9ee.png" width="16"/> **Tokenizer** | Custom BPE trained on KBBI (4,000 vocab) |
+| <img src="https://em-content.zobj.net/source/twitter/376/wrench_1f527.png" width="16"/> **Optimizer** | AdamW (lr=1e-3, weight_decay=0.1) |
+| <img src="https://em-content.zobj.net/source/twitter/376/bullseye_1f3af.png" width="16"/> **Objective** | Next-token prediction (causal language modeling) |
+| <img src="https://em-content.zobj.net/source/twitter/376/shield_1f6e1-fe0f.png" width="16"/> **Gradient** | Clipping at norm 1.0 |
+| <img src="https://em-content.zobj.net/source/twitter/376/chart-decreasing_1f4c9.png" width="16"/> **Schedule** | Cosine decay with 30-step warmup |
+---
+## <img src="https://em-content.zobj.net/source/twitter/376/open-file-folder_1f4c2.png" width="24"/> Project Structure
+```
+slm-bahasa-id/
+  model.py              # Transformer architecture (from scratch)
+  model.safetensors     # Trained weights (~3.5 MB)
+  config.json           # Model configuration
+  bpe_tokenizer.py      # Custom BPE tokenizer implementation
+  vocab.json            # Tokenizer vocabulary (4,000 tokens)
+  merges.txt            # BPE merge rules
+  tokenizer.json        # HF-compatible tokenizer config
+  generate.py           # Text generation & demo script
+  train.py              # Full training pipeline
+  README.md             # This file
+```
+---
+## <img src="https://em-content.zobj.net/source/twitter/376/warning_26a0-fe0f.png" width="24"/> Limitations
+> This is a **proof-of-concept / educational model**:
+- <img src="https://em-content.zobj.net/source/twitter/376/small-blue-diamond_1f539.png" width="14"/> **840K params** — can continue sentences but doesn't "understand"
+- <img src="https://em-content.zobj.net/source/twitter/376/small-blue-diamond_1f539.png" width="14"/> **Limited data** — trained on KBBI definitions, outputs may be incoherent
+- <img src="https://em-content.zobj.net/source/twitter/376/small-blue-diamond_1f539.png" width="14"/> **Not for production** — educational purpose only
+- <img src="https://em-content.zobj.net/source/twitter/376/small-blue-diamond_1f539.png" width="14"/> **Short context** — 64 token context window
+---
+## <img src="https://em-content.zobj.net/source/twitter/376/light-bulb_1f4a1.png" width="24"/> What This Demonstrates
+Building this project from scratch demonstrates understanding of:
+| # | Topic | Details |
+|---|---|---|
+| 1 | <img src="https://em-content.zobj.net/source/twitter/376/puzzle-piece_1f9e9.png" width="16"/> **Tokenization** | BPE algorithm, subword encoding, vocabulary construction |
+| 2 | <img src="https://em-content.zobj.net/source/twitter/376/brain_1f9e0.png" width="16"/> **Transformer** | Multi-head attention, FFN, normalization, residual connections |
+| 3 | <img src="https://em-content.zobj.net/source/twitter/376/sparkles_2728.png" width="16"/> **Modern Techniques** | RoPE, RMSNorm, SwiGLU — same as production LLMs |
+| 4 | <img src="https://em-content.zobj.net/source/twitter/376/weight-lifting_1f3cb-fe0f.png" width="16"/> **Training Pipeline** | Data loading, loss computation, gradient clipping, LR scheduling |
+| 5 | <img src="https://em-content.zobj.net/source/twitter/376/speech-balloon_1f4ac.png" width="16"/> **Text Generation** | Autoregressive decoding, top-k, top-p, temperature sampling |
+| 6 | <img src="https://em-content.zobj.net/source/twitter/376/package_1f4e6.png" width="16"/> **Deployment** | Model serialization, HuggingFace Hub integration |
+---
+## <img src="https://em-content.zobj.net/source/twitter/376/handshake_1f91d.png" width="24"/> Contributing
+Contributions are welcome! Feel free to:
+- Open issues for bugs or feature requests
+- Submit pull requests with improvements
+- Share your experiments and results
+---
+## <img src="https://em-content.zobj.net/source/twitter/376/bust-in-silhouette_1f464.png" width="24"/> Author
+<div align="center">
+Built with <img src="https://em-content.zobj.net/source/twitter/376/red-heart_2764-fe0f.png" width="16"/> by **Jekardah AI Lab** <img src="https://em-content.zobj.net/source/twitter/376/flag-indonesia_1f1ee-1f1e9.png" width="20"/>
+</div>
+---
+## <img src="https://em-content.zobj.net/source/twitter/376/scroll_1f4dc.png" width="24"/> License
+This project is licensed under the **MIT License** — see the [LICENSE](LICENSE) file for details.