romizone commited on
Commit
2cff0a8
·
verified ·
1 Parent(s): 9815efc

Upload SLM Bahasa Indonesia

Browse files
Files changed (1) hide show
  1. README.md +286 -57
README.md CHANGED
@@ -8,45 +8,89 @@ tags:
8
  - slm
9
  - from-scratch
10
  - kbbi
 
 
11
  license: mit
12
  pipeline_tag: text-generation
13
  ---
14
 
15
- # SLM Bahasa Indonesia 🇮🇩
16
 
17
- A **Small Language Model** built entirely from scratch using PyTorch — trained on KBBI (Kamus Besar Bahasa Indonesia).
18
 
19
- ## Overview
20
 
21
- This is a decoder-only Transformer (GPT-style) built from the ground up, demonstrating the full pipeline: custom tokenizer → model architecture → training → inference.
 
 
 
22
 
23
- ### Architecture
24
 
25
- | Component | Detail |
26
- |---|---|
27
- | Type | Decoder-only Transformer |
28
- | Parameters | **840K** (~3.5 MB) |
29
- | Embedding dim | 128 |
30
- | Layers | 2 |
31
- | Attention heads | 4 |
32
- | FFN dim | 256 |
33
- | Context length | 64 tokens |
34
- | Vocab size | 4,000 (BPE, KBBI-trained) |
35
-
36
- ### Modern Techniques Used
37
- - **RoPE** (Rotary Position Embedding) — same as LLaMA/Qwen
38
- - **RMSNorm** — more efficient than LayerNorm
39
- - **SwiGLU** activation same as LLaMA/Mistral
40
- - **Weight tying** — embedding weights shared with output head
41
- - **Cosine LR schedule** with warmup
42
-
43
- ## Quick Start
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
44
 
45
  ```python
46
  import torch
47
  from model import SmallLM
48
  from bpe_tokenizer import BPETokenizer
49
 
 
50
  model = SmallLM.from_pretrained("./")
51
  tokenizer = BPETokenizer.from_pretrained("./")
52
 
@@ -57,48 +101,233 @@ output = model.generate(input_ids, max_new_tokens=30, temperature=0.8)
57
  print(tokenizer.decode(output[0].tolist()))
58
  ```
59
 
60
- ## Training Details
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
61
 
62
- - **Data**: KBBI PDF (1,844 halaman, 21,627 entri, ~1.9M token) + curated Indonesian text corpus
63
- - **Tokenizer**: Custom BPE trained on KBBI (4,000 vocab)
64
- - **Optimizer**: AdamW (lr=1e-3, weight_decay=0.1)
65
- - **Training**: Next-token prediction (causal language modeling)
66
 
67
- ## Limitations
 
68
 
69
- This is a **proof-of-concept / educational model**:
70
- - 840K params — can continue sentences but doesn't "understand"
71
- - Trained on limited data — outputs may be incoherent
72
- - Not suitable for production use
73
- - Value is in the **architecture and pipeline**, not output quality
 
 
 
 
 
 
 
 
 
74
 
75
- ## Files
 
 
76
 
77
- | File | Description |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
78
  |---|---|
79
- | `model.py` | Transformer architecture (from scratch) |
80
- | `model.safetensors` | Trained weights |
81
- | `config.json` | Model configuration |
82
- | `bpe_tokenizer.py` | Custom BPE tokenizer code |
83
- | `vocab.json` | Tokenizer vocabulary |
84
- | `merges.txt` | BPE merge rules |
85
- | `generate.py` | Text generation script |
86
- | `train.py` | Training script |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
87
 
88
- ## What This Demonstrates
 
 
89
 
90
- Building this project from scratch shows understanding of:
91
- 1. **Tokenization** — BPE algorithm, subword encoding
92
- 2. **Transformer architecture** — attention, FFN, normalization
93
- 3. **Modern techniques** — RoPE, RMSNorm, SwiGLU
94
- 4. **Training pipeline** — data loading, loss computation, optimization
95
- 5. **Text generation** — autoregressive decoding, sampling strategies
96
- 6. **Model deployment** — saving, loading, HuggingFace compatibility
97
 
98
- ## Author
 
 
 
 
 
 
 
 
 
99
 
100
- Built by **Jekardah AI Lab** 🇮🇩
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
101
 
102
- ## License
103
 
104
- MIT License
 
8
  - slm
9
  - from-scratch
10
  - kbbi
11
+ - pytorch
12
+ - educational
13
  license: mit
14
  pipeline_tag: text-generation
15
  ---
16
 
17
+ <div align="center">
18
 
19
+ # <img src="https://em-content.zobj.net/source/twitter/376/flag-indonesia_1f1ee-1f1e9.png" width="36"/> SLM Bahasa Indonesia
20
 
21
+ **Small Language Model | Built from Scratch | Powered by KBBI**
22
 
23
+ [![Python](https://img.shields.io/badge/Python-3.8+-3776AB?style=for-the-badge&logo=python&logoColor=white)](https://python.org)
24
+ [![PyTorch](https://img.shields.io/badge/PyTorch-2.0+-EE4C2C?style=for-the-badge&logo=pytorch&logoColor=white)](https://pytorch.org)
25
+ [![License](https://img.shields.io/badge/License-MIT-green?style=for-the-badge&logo=opensourceinitiative&logoColor=white)](LICENSE)
26
+ [![HuggingFace](https://img.shields.io/badge/HuggingFace-Model-FFD21E?style=for-the-badge&logo=huggingface&logoColor=black)](https://huggingface.co/romizone/slm-bahasa-id)
27
 
28
+ <img src="https://img.shields.io/badge/Parameters-840K-blue?style=flat-square"/> <img src="https://img.shields.io/badge/Model_Size-3.5_MB-blue?style=flat-square"/> <img src="https://img.shields.io/badge/Vocab-4,000_BPE-blue?style=flat-square"/> <img src="https://img.shields.io/badge/Data-KBBI_1,844_pages-blue?style=flat-square"/>
29
 
30
+ ---
31
+
32
+ *A decoder-only Transformer (GPT-style) built entirely from the ground up using PyTorch,
33
+ trained on Kamus Besar Bahasa Indonesia (KBBI).*
34
+
35
+ </div>
36
+
37
+ ---
38
+
39
+ ## <img src="https://em-content.zobj.net/source/twitter/376/rocket_1f680.png" width="24"/> Overview
40
+
41
+ This project demonstrates the **complete pipeline** of building a language model from scratch:
42
+
43
+ ```
44
+ Custom BPE Tokenizer --> Transformer Architecture --> Training --> Inference --> Deployment
45
+ ```
46
+
47
+ > **Note:** This is an educational/proof-of-concept model. The value is in the **architecture and pipeline**, not output quality.
48
+
49
+ ---
50
+
51
+ ## <img src="https://em-content.zobj.net/source/twitter/376/building-construction_1f3d7-fe0f.png" width="24"/> Architecture
52
+
53
+ <table>
54
+ <tr><td><b>Component</b></td><td><b>Detail</b></td></tr>
55
+ <tr><td><img src="https://em-content.zobj.net/source/twitter/376/brain_1f9e0.png" width="16"/> Type</td><td>Decoder-only Transformer (GPT-style)</td></tr>
56
+ <tr><td><img src="https://em-content.zobj.net/source/twitter/376/bar-chart_1f4ca.png" width="16"/> Parameters</td><td><b>840K</b> (~3.5 MB)</td></tr>
57
+ <tr><td><img src="https://em-content.zobj.net/source/twitter/376/gear_2699-fe0f.png" width="16"/> Embedding dim</td><td>128</td></tr>
58
+ <tr><td><img src="https://em-content.zobj.net/source/twitter/376/bricks_1f9f1.png" width="16"/> Layers</td><td>2</td></tr>
59
+ <tr><td><img src="https://em-content.zobj.net/source/twitter/376/eyes_1f440.png" width="16"/> Attention heads</td><td>4</td></tr>
60
+ <tr><td><img src="https://em-content.zobj.net/source/twitter/376/zap_26a1.png" width="16"/> FFN dim</td><td>256</td></tr>
61
+ <tr><td><img src="https://em-content.zobj.net/source/twitter/376/straight-ruler_1f4cf.png" width="16"/> Context length</td><td>64 tokens</td></tr>
62
+ <tr><td><img src="https://em-content.zobj.net/source/twitter/376/books_1f4da.png" width="16"/> Vocab size</td><td>4,000 (BPE, KBBI-trained)</td></tr>
63
+ </table>
64
+
65
+ ### <img src="https://em-content.zobj.net/source/twitter/376/sparkles_2728.png" width="20"/> Modern Techniques
66
+
67
+ | Technique | Description | Used By |
68
+ |---|---|---|
69
+ | <img src="https://em-content.zobj.net/source/twitter/376/cyclone_1f300.png" width="16"/> **RoPE** | Rotary Position Embedding | LLaMA, Qwen |
70
+ | <img src="https://em-content.zobj.net/source/twitter/376/high-voltage_26a1.png" width="16"/> **RMSNorm** | Root Mean Square Normalization | LLaMA, Gemma |
71
+ | <img src="https://em-content.zobj.net/source/twitter/376/fire_1f525.png" width="16"/> **SwiGLU** | Gated Linear Unit with Swish | LLaMA, Mistral |
72
+ | <img src="https://em-content.zobj.net/source/twitter/376/link_1f517.png" width="16"/> **Weight Tying** | Shared embedding & output weights | GPT-2, LLaMA |
73
+ | <img src="https://em-content.zobj.net/source/twitter/376/chart-decreasing_1f4c9.png" width="16"/> **Cosine LR** | Cosine schedule with warmup | Standard practice |
74
+
75
+ ---
76
+
77
+ ## <img src="https://em-content.zobj.net/source/twitter/376/laptop_1f4bb.png" width="24"/> Quick Start (Local)
78
+
79
+ ```bash
80
+ # Clone the repository
81
+ git clone https://huggingface.co/romizone/slm-bahasa-id
82
+ cd slm-bahasa-id
83
+
84
+ # Install dependencies
85
+ pip install torch safetensors
86
+ ```
87
 
88
  ```python
89
  import torch
90
  from model import SmallLM
91
  from bpe_tokenizer import BPETokenizer
92
 
93
+ # Load model & tokenizer
94
  model = SmallLM.from_pretrained("./")
95
  tokenizer = BPETokenizer.from_pretrained("./")
96
 
 
101
  print(tokenizer.decode(output[0].tolist()))
102
  ```
103
 
104
+ ---
105
+
106
+ ## <img src="https://em-content.zobj.net/source/twitter/376/test-tube_1f9ea.png" width="24"/> Run on Google Colab
107
+
108
+ [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/)
109
+
110
+ Buat notebook baru di Google Colab, lalu jalankan cell berikut:
111
+
112
+ ### Cell 1 - Setup & Download Model
113
+
114
+ ```python
115
+ # Install dependencies
116
+ !pip install torch safetensors huggingface_hub -q
117
+
118
+ # Download model dari HuggingFace
119
+ from huggingface_hub import snapshot_download
120
+ model_dir = snapshot_download(repo_id="romizone/slm-bahasa-id")
121
+ print(f"Model downloaded to: {model_dir}")
122
+ ```
123
+
124
+ ### Cell 2 - Load Model
125
 
126
+ ```python
127
+ import sys, torch
128
+ sys.path.insert(0, model_dir)
 
129
 
130
+ from model import SmallLM
131
+ from bpe_tokenizer import BPETokenizer
132
 
133
+ model = SmallLM.from_pretrained(model_dir)
134
+ tokenizer = BPETokenizer.from_pretrained(model_dir)
135
+ print(f"Model loaded! Parameters: {model.count_parameters():,}")
136
+ ```
137
+
138
+ ### Cell 3 - Generate Text
139
+
140
+ ```python
141
+ def generate_text(prompt, max_tokens=50, temperature=0.8, top_k=40):
142
+ ids = tokenizer.encode(prompt.lower())
143
+ input_ids = torch.tensor([ids])
144
+ output = model.generate(input_ids, max_new_tokens=max_tokens,
145
+ temperature=temperature, top_k=top_k)
146
+ return tokenizer.decode(output[0].tolist())
147
 
148
+ # Coba berbagai prompt
149
+ prompts = ["indonesia adalah", "pendidikan", "teknologi", "jakarta",
150
+ "ekonomi", "kebudayaan", "demokrasi", "hutan"]
151
 
152
+ for p in prompts:
153
+ result = generate_text(p)
154
+ print(f"Prompt: \"{p}\"")
155
+ print(f"Output: {result[:100]}")
156
+ print("-" * 60)
157
+ ```
158
+
159
+ ### Cell 4 - Interactive Mode (Opsional)
160
+
161
+ ```python
162
+ # Interactive: ketik prompt sendiri
163
+ while True:
164
+ prompt = input("\nMasukkan prompt (ketik 'quit' untuk keluar): ")
165
+ if prompt.lower() in ['quit', 'exit', 'q']:
166
+ break
167
+ result = generate_text(prompt, max_tokens=50)
168
+ print(f"Output: {result}")
169
+ ```
170
+
171
+ ---
172
+
173
+ ## <img src="https://em-content.zobj.net/source/twitter/376/gem-stone_1f48e.png" width="24"/> Run on Kaggle
174
+
175
+ [![Kaggle](https://kaggle.com/static/images/open-in-kaggle.svg)](https://www.kaggle.com/)
176
+
177
+ Buat notebook baru di Kaggle, lalu jalankan cell berikut:
178
+
179
+ ### Cell 1 - Setup & Download Model
180
+
181
+ ```python
182
+ # Install huggingface_hub (torch & safetensors sudah pre-installed di Kaggle)
183
+ !pip install huggingface_hub -q
184
+
185
+ # Download model
186
+ from huggingface_hub import snapshot_download
187
+ model_dir = snapshot_download(repo_id="romizone/slm-bahasa-id")
188
+ print(f"Model downloaded to: {model_dir}")
189
+ ```
190
+
191
+ ### Cell 2 - Load Model
192
+
193
+ ```python
194
+ import sys, torch
195
+ sys.path.insert(0, model_dir)
196
+
197
+ from model import SmallLM
198
+ from bpe_tokenizer import BPETokenizer
199
+
200
+ # Gunakan GPU jika tersedia
201
+ device = "cuda" if torch.cuda.is_available() else "cpu"
202
+ print(f"Using device: {device}")
203
+
204
+ model = SmallLM.from_pretrained(model_dir, device=device)
205
+ tokenizer = BPETokenizer.from_pretrained(model_dir)
206
+ print(f"Model loaded! Parameters: {model.count_parameters():,}")
207
+ ```
208
+
209
+ ### Cell 3 - Generate Text
210
+
211
+ ```python
212
+ def generate_text(prompt, max_tokens=50, temperature=0.8, top_k=40):
213
+ ids = tokenizer.encode(prompt.lower())
214
+ input_ids = torch.tensor([ids]).to(device)
215
+ output = model.generate(input_ids, max_new_tokens=max_tokens,
216
+ temperature=temperature, top_k=top_k)
217
+ return tokenizer.decode(output[0].tolist())
218
+
219
+ # Coba berbagai prompt
220
+ prompts = ["indonesia adalah", "pendidikan", "teknologi", "jakarta",
221
+ "ekonomi", "kebudayaan", "demokrasi", "hutan"]
222
+
223
+ for p in prompts:
224
+ result = generate_text(p)
225
+ print(f"Prompt: \"{p}\"")
226
+ print(f"Output: {result[:100]}")
227
+ print("-" * 60)
228
+ ```
229
+
230
+ ### Cell 4 - Retrain Model di Kaggle (Opsional)
231
+
232
+ ```python
233
+ # Jika ingin retrain dengan data sendiri:
234
+ import shutil, os
235
+
236
+ # Copy file ke working directory
237
+ work_dir = "/kaggle/working/slm"
238
+ os.makedirs(work_dir, exist_ok=True)
239
+ for f in os.listdir(model_dir):
240
+ shutil.copy2(os.path.join(model_dir, f), os.path.join(work_dir, f))
241
+
242
+ os.chdir(work_dir)
243
+
244
+ # Edit train.py sesuai kebutuhan, lalu:
245
+ # !python train.py
246
+ ```
247
+
248
+ > **Tips Kaggle:**
249
+ > - Gunakan **GPU P100** (gratis) untuk training lebih cepat
250
+ > - Aktifkan GPU: *Settings > Accelerator > GPU*
251
+ > - Kaggle sudah pre-install PyTorch, jadi tidak perlu install ulang
252
+
253
+ ---
254
+
255
+ ## <img src="https://em-content.zobj.net/source/twitter/376/graduation-cap_1f393.png" width="24"/> Training Details
256
+
257
+ | | Detail |
258
  |---|---|
259
+ | <img src="https://em-content.zobj.net/source/twitter/376/books_1f4da.png" width="16"/> **Data** | KBBI PDF (1,844 halaman, 21,627 entri, ~1.9M token) + curated Indonesian corpus |
260
+ | <img src="https://em-content.zobj.net/source/twitter/376/abacus_1f9ee.png" width="16"/> **Tokenizer** | Custom BPE trained on KBBI (4,000 vocab) |
261
+ | <img src="https://em-content.zobj.net/source/twitter/376/wrench_1f527.png" width="16"/> **Optimizer** | AdamW (lr=1e-3, weight_decay=0.1) |
262
+ | <img src="https://em-content.zobj.net/source/twitter/376/bullseye_1f3af.png" width="16"/> **Objective** | Next-token prediction (causal language modeling) |
263
+ | <img src="https://em-content.zobj.net/source/twitter/376/shield_1f6e1-fe0f.png" width="16"/> **Gradient** | Clipping at norm 1.0 |
264
+ | <img src="https://em-content.zobj.net/source/twitter/376/chart-decreasing_1f4c9.png" width="16"/> **Schedule** | Cosine decay with 30-step warmup |
265
+
266
+ ---
267
+
268
+ ## <img src="https://em-content.zobj.net/source/twitter/376/open-file-folder_1f4c2.png" width="24"/> Project Structure
269
+
270
+ ```
271
+ slm-bahasa-id/
272
+ model.py # Transformer architecture (from scratch)
273
+ model.safetensors # Trained weights (~3.5 MB)
274
+ config.json # Model configuration
275
+ bpe_tokenizer.py # Custom BPE tokenizer implementation
276
+ vocab.json # Tokenizer vocabulary (4,000 tokens)
277
+ merges.txt # BPE merge rules
278
+ tokenizer.json # HF-compatible tokenizer config
279
+ generate.py # Text generation & demo script
280
+ train.py # Full training pipeline
281
+ README.md # This file
282
+ ```
283
+
284
+ ---
285
+
286
+ ## <img src="https://em-content.zobj.net/source/twitter/376/warning_26a0-fe0f.png" width="24"/> Limitations
287
+
288
+ > This is a **proof-of-concept / educational model**:
289
+
290
+ - <img src="https://em-content.zobj.net/source/twitter/376/small-blue-diamond_1f539.png" width="14"/> **840K params** — can continue sentences but doesn't "understand"
291
+ - <img src="https://em-content.zobj.net/source/twitter/376/small-blue-diamond_1f539.png" width="14"/> **Limited data** — trained on KBBI definitions, outputs may be incoherent
292
+ - <img src="https://em-content.zobj.net/source/twitter/376/small-blue-diamond_1f539.png" width="14"/> **Not for production** — educational purpose only
293
+ - <img src="https://em-content.zobj.net/source/twitter/376/small-blue-diamond_1f539.png" width="14"/> **Short context** — 64 token context window
294
 
295
+ ---
296
+
297
+ ## <img src="https://em-content.zobj.net/source/twitter/376/light-bulb_1f4a1.png" width="24"/> What This Demonstrates
298
 
299
+ Building this project from scratch demonstrates understanding of:
 
 
 
 
 
 
300
 
301
+ | # | Topic | Details |
302
+ |---|---|---|
303
+ | 1 | <img src="https://em-content.zobj.net/source/twitter/376/puzzle-piece_1f9e9.png" width="16"/> **Tokenization** | BPE algorithm, subword encoding, vocabulary construction |
304
+ | 2 | <img src="https://em-content.zobj.net/source/twitter/376/brain_1f9e0.png" width="16"/> **Transformer** | Multi-head attention, FFN, normalization, residual connections |
305
+ | 3 | <img src="https://em-content.zobj.net/source/twitter/376/sparkles_2728.png" width="16"/> **Modern Techniques** | RoPE, RMSNorm, SwiGLU — same as production LLMs |
306
+ | 4 | <img src="https://em-content.zobj.net/source/twitter/376/weight-lifting_1f3cb-fe0f.png" width="16"/> **Training Pipeline** | Data loading, loss computation, gradient clipping, LR scheduling |
307
+ | 5 | <img src="https://em-content.zobj.net/source/twitter/376/speech-balloon_1f4ac.png" width="16"/> **Text Generation** | Autoregressive decoding, top-k, top-p, temperature sampling |
308
+ | 6 | <img src="https://em-content.zobj.net/source/twitter/376/package_1f4e6.png" width="16"/> **Deployment** | Model serialization, HuggingFace Hub integration |
309
+
310
+ ---
311
 
312
+ ## <img src="https://em-content.zobj.net/source/twitter/376/handshake_1f91d.png" width="24"/> Contributing
313
+
314
+ Contributions are welcome! Feel free to:
315
+ - Open issues for bugs or feature requests
316
+ - Submit pull requests with improvements
317
+ - Share your experiments and results
318
+
319
+ ---
320
+
321
+ ## <img src="https://em-content.zobj.net/source/twitter/376/bust-in-silhouette_1f464.png" width="24"/> Author
322
+
323
+ <div align="center">
324
+
325
+ Built with <img src="https://em-content.zobj.net/source/twitter/376/red-heart_2764-fe0f.png" width="16"/> by **Jekardah AI Lab** <img src="https://em-content.zobj.net/source/twitter/376/flag-indonesia_1f1ee-1f1e9.png" width="20"/>
326
+
327
+ </div>
328
+
329
+ ---
330
 
331
+ ## <img src="https://em-content.zobj.net/source/twitter/376/scroll_1f4dc.png" width="24"/> License
332
 
333
+ This project is licensed under the **MIT License** — see the [LICENSE](LICENSE) file for details.