slm-bahasa-id / README.md
romizone's picture
Upload SLM Bahasa Indonesia
9815efc verified
|
Raw
History Blame
3.01 kB
metadata
language: id
tags:
  - text-generation
  - transformer
  - bahasa-indonesia
  - indonesian
  - slm
  - from-scratch
  - kbbi
license: mit
pipeline_tag: text-generation

SLM Bahasa Indonesia ๐Ÿ‡ฎ๐Ÿ‡ฉ

A Small Language Model built entirely from scratch using PyTorch โ€” trained on KBBI (Kamus Besar Bahasa Indonesia).

Overview

This is a decoder-only Transformer (GPT-style) built from the ground up, demonstrating the full pipeline: custom tokenizer โ†’ model architecture โ†’ training โ†’ inference.

Architecture

Component Detail
Type Decoder-only Transformer
Parameters 840K (~3.5 MB)
Embedding dim 128
Layers 2
Attention heads 4
FFN dim 256
Context length 64 tokens
Vocab size 4,000 (BPE, KBBI-trained)

Modern Techniques Used

  • RoPE (Rotary Position Embedding) โ€” same as LLaMA/Qwen
  • RMSNorm โ€” more efficient than LayerNorm
  • SwiGLU activation โ€” same as LLaMA/Mistral
  • Weight tying โ€” embedding weights shared with output head
  • Cosine LR schedule with warmup

Quick Start

import torch
from model import SmallLM
from bpe_tokenizer import BPETokenizer

model = SmallLM.from_pretrained("./")
tokenizer = BPETokenizer.from_pretrained("./")

# Generate text
ids = tokenizer.encode("indonesia adalah")
input_ids = torch.tensor([ids])
output = model.generate(input_ids, max_new_tokens=30, temperature=0.8)
print(tokenizer.decode(output[0].tolist()))

Training Details

  • Data: KBBI PDF (1,844 halaman, 21,627 entri, ~1.9M token) + curated Indonesian text corpus
  • Tokenizer: Custom BPE trained on KBBI (4,000 vocab)
  • Optimizer: AdamW (lr=1e-3, weight_decay=0.1)
  • Training: Next-token prediction (causal language modeling)

Limitations

This is a proof-of-concept / educational model:

  • 840K params โ€” can continue sentences but doesn't "understand"
  • Trained on limited data โ€” outputs may be incoherent
  • Not suitable for production use
  • Value is in the architecture and pipeline, not output quality

Files

File Description
model.py Transformer architecture (from scratch)
model.safetensors Trained weights
config.json Model configuration
bpe_tokenizer.py Custom BPE tokenizer code
vocab.json Tokenizer vocabulary
merges.txt BPE merge rules
generate.py Text generation script
train.py Training script

What This Demonstrates

Building this project from scratch shows understanding of:

  1. Tokenization โ€” BPE algorithm, subword encoding
  2. Transformer architecture โ€” attention, FFN, normalization
  3. Modern techniques โ€” RoPE, RMSNorm, SwiGLU
  4. Training pipeline โ€” data loading, loss computation, optimization
  5. Text generation โ€” autoregressive decoding, sampling strategies
  6. Model deployment โ€” saving, loading, HuggingFace compatibility

Author

Built by Jekardah AI Lab ๐Ÿ‡ฎ๐Ÿ‡ฉ

License

MIT License