Upload folder using huggingface_hub

Browse files

Files changed (8) hide show

.gitattributes +1 -0
README.md +118 -0
config.json +18 -0
layer_types.json +22 -0
loss_curves.png +3 -0
lr_schedule.png +0 -0
pytorch_model.bin +3 -0
training_config.json +19 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+loss_curves.png filter=lfs diff=lfs merge=lfs -text

README.md ADDED Viewed

	@@ -0,0 +1,118 @@

+---
+license: apache-2.0
+tags:
+  - gemma3
+  - language-model
+  - pre-training
+  - from-scratch
+  - tinystories
+  - transformer
+  - multi-query-attention
+  - sliding-window-attention
+  - rope
+language:
+  - en
+datasets:
+  - roneneldan/TinyStories
+metrics:
+  - perplexity
+pipeline_tag: text-generation
+---
+# Gemma 3 270M — Pre-trained from Scratch on TinyStories
+A custom implementation of the **Gemma 3 architecture** (scaled to 164.6M parameters), pre-trained from scratch on the TinyStories dataset.
+## 📊 Results
+| Metric | Value |
+|--------|-------|
+| **Best Val Loss** | 1.7845 |
+| **Perplexity** | 5.96 |
+| **Best Iteration** | 13,000 |
+| **Parameters** | 164.6M |
+![Training Loss Curves](loss_curves.png)
+## 🏗️ Architecture
+This model implements the **complete Gemma 3 architecture** with all modern innovations:
+| Component | Specification |
+|-----------|--------------|
+| Layers | 18 (15 sliding + 3 full attention) |
+| Embedding Dim | 640 |
+| Attention Heads | 4 (Multi-Query, 1 KV group) |
+| Head Dimension | 256 |
+| FFN Hidden | 2,048 (GeGLU activation) |
+| Context Length | 32,768 tokens |
+| Vocabulary | 50,257 (GPT-2 BPE) |
+### Key Features
+- **Sliding Window Attention** (w=512): O(n×w) instead of O(n²), 64× cheaper
+- **Multi-Query Attention**: All query heads share 1 K,V head — 4× less KV cache
+- **RoPE with Dual Bases**: 10K (local patterns) + 1M (long-range dependencies)
+- **QK Normalization**: RMSNorm on Q,K vectors before attention
+- **Gemma-style RMSNorm**: (1 + weight) scaling for stable initialization
+- **GeGLU Feed-Forward**: Gated GELU activation with 3.2× expansion
+### Layer Type Pattern
+```
+Layers 1-5:   Sliding Attention (local, base=10K)
+Layer 6:      Full Attention (global, base=1M)
+Layers 7-11:  Sliding Attention (local, base=10K)
+Layer 12:     Full Attention (global, base=1M)
+Layers 13-17: Sliding Attention (local, base=10K)
+Layer 18:     Full Attention (global, base=1M)
+```
+## 📖 Training
+- **Dataset**: TinyStories (2.1M stories, 471M tokens)
+- **Tokenizer**: GPT-2 BPE via tiktoken (50,257 vocab)
+- **Optimizer**: AdamW (β1=0.9, β2=0.95, ε=1e-9, weight_decay=0.1)
+- **Learning Rate**: 1e-4 → 5e-5 (cosine decay with 1K step warmup)
+- **Precision**: bfloat16 mixed precision
+- **Hardware**: NVIDIA A100 40GB (Google Colab Pro)
+- **Gradient Clipping**: max_norm=0.5
+## 💻 Usage
+```python
+import torch
+import tiktoken
+# Load model (you'll need the model class definition)
+model = Gemma3Model(config)
+state_dict = torch.load("pytorch_model.bin", map_location="cpu")
+model.load_state_dict(state_dict)
+model.eval()
+# Tokenize
+enc = tiktoken.get_encoding("gpt2")
+prompt = "Once upon a time"
+input_ids = torch.tensor([enc.encode_ordinary(prompt)])
+# Generate
+with torch.no_grad():
+    output = model.generate(input_ids, max_new_tokens=200, temperature=0.7)
+print(enc.decode(output[0].tolist()))
+```
+## 📝 Sample Outputs
+**Prompt**: "Once upon a time, there was a little cat named Mittens"
+**Temperature 0.7**: *Mittens was very hungry and wanted to eat some food. She went outside
+to find some grass to eat. Mittens saw a big tree and decided to climb it. She climbed up and
+up until she reached the top. As she was in the tree, she saw a small bird with a broken wing.
+Mittens knew just what to do. She took the bird to her mom and asked for help.*
+## 🙏 Credits
+- **Architecture Reference**: Vizuara Team - Raj ([Tutorial](https://youtu.be/bLDlwcl6hbA))
+- **Dataset**: TinyStories by Ronen Eldan & Yuanzhi Li
+- **Tokenizer**: OpenAI tiktoken (GPT-2 BPE)
+## 📄 License
+Apache 2.0

config.json ADDED Viewed

	@@ -0,0 +1,18 @@

+{
+  "architecture": "Gemma3Custom",
+  "vocab_size": 50257,
+  "context_length": 32768,
+  "emb_dim": 640,
+  "n_layers": 18,
+  "n_heads": 4,
+  "head_dim": 256,
+  "hidden_dim": 2048,
+  "n_kv_groups": 1,
+  "qk_norm": true,
+  "query_pre_attn_scalar": 256,
+  "rope_base": 1000000.0,
+  "rope_local_base": 10000.0,
+  "sliding_window": 512,
+  "dtype": "bfloat16",
+  "total_parameters": 164631936
+}

layer_types.json ADDED Viewed

	@@ -0,0 +1,22 @@

+{
+  "layer_types": [
+    "sliding_attention",
+    "sliding_attention",
+    "sliding_attention",
+    "sliding_attention",
+    "sliding_attention",
+    "full_attention",
+    "sliding_attention",
+    "sliding_attention",
+    "sliding_attention",
+    "sliding_attention",
+    "sliding_attention",
+    "full_attention",
+    "sliding_attention",
+    "sliding_attention",
+    "sliding_attention",
+    "sliding_attention",
+    "sliding_attention",
+    "full_attention"
+  ]
+}

loss_curves.png ADDED Viewed

Git LFS Details

SHA256: d316ec76582a2a1da379c9d24a174cb2eabb7b602333a49e237f09aea84b77b2
Pointer size: 131 Bytes
Size of remote file: 103 kB

lr_schedule.png ADDED Viewed

pytorch_model.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:f93bc6fbab3eb74705e5a412b0b44eb031fdd1af4203bb1f6fbd8f92878c5f2c
+size 329341959

training_config.json ADDED Viewed

	@@ -0,0 +1,19 @@

+{
+  "max_iters": 60000,
+  "batch_size": 32,
+  "block_size": 128,
+  "gradient_accumulation_steps": 4,
+  "learning_rate": 0.0001,
+  "min_lr": 5e-05,
+  "warmup_steps": 1000,
+  "beta1": 0.9,
+  "beta2": 0.95,
+  "weight_decay": 0.1,
+  "gradient_clip_norm": 0.5,
+  "best_val_loss": 1.7845207452774048,
+  "best_iteration": 13000,
+  "perplexity": 5.96,
+  "dataset": "roneneldan/TinyStories",
+  "tokenizer": "gpt2 (tiktoken)",
+  "hardware": "NVIDIA A100 40GB"
+}