--- license: apache-2.0 language: - en - code library_name: pytorch pipeline_tag: text-generation tags: - text-generation - transformer - gpt - llama - 1b - pretraining - from-scratch datasets: - HuggingFaceTB/smollm-corpus - bigcode/starcoderdata --- # auto-g-nano-1b A from-scratch ~**1.05B** parameter decoder-only Transformer, pre-trained on a mixed web + synthetic + code corpus. This is a research / educational model trained on a deliberately small token budget (~13B tokens, roughly 60% of Chinchilla-optimal for 1B params), so it should be evaluated as a "hello world" 1B base model, not a production assistant. Source code: https://github.com/geoffsee/auto-g-nano (branch `claude/build-billion-param-model-cOPdo`). ## Architecture Llama-style deeper-narrower decoder. RMSNorm + RoPE + Grouped-Query Attention + SwiGLU FFN. Untied embeddings. | field | value | |-----------------|-------| | total params | **1,050,002,688 (1.050B)** | | layers | 24 | | embedding dim | 1792 | | query heads | 14 | | KV heads (GQA) | 2 (7× key/value sharing) | | head dim | 128 | | FFN hidden | 5376 (3×d, SwiGLU) | | context length | 1024 | | vocab | 50,257 (tiktoken `gpt2` BPE) | | RoPE θ | 500,000 | | precision | bf16 (training) | ## Training data A 3-way interleaved stream (HF `interleave_datasets` with weights): | weight | source | |-------:|-----------------------------------------------------| | 0.40 | `HuggingFaceTB/smollm-corpus` / `fineweb-edu-dedup` | | 0.25 | `HuggingFaceTB/smollm-corpus` / `cosmopedia-v2` | | 0.35 | `bigcode/starcoderdata` / `python` (gated) | The code share is intentionally aggressive (~35%) compared to SmolLM's natural ratio (~2%), so a 1B model trained on a small token budget actually picks up code patterns instead of treating them as noise. ## Variants in this repo | file | size | notes | |----------------------------|------:|-------| | `model_latest.pt` | 4.2 GB | base, fp32 (original training output) | | `model_bf16.pt` | 2.0 GB | base, bf16 PyTorch | | `model.safetensors` | 2.0 GB | base, bf16 safetensors (recommended for inference) | | `model.onnx` + `.onnx.data`| 2.0 GB | base, fp16 ONNX (CPU/MPS-friendly) | | **`model_sft.safetensors`** | **2.0 GB** | **SFT'd on `databricks/databricks-dolly-15k`** — instruction-following | ### SFT variant Brief supervised fine-tuning on the [Dolly-15k](https://huggingface.co/datasets/databricks/databricks-dolly-15k) instruction dataset (15,011 instruction/response pairs, Alpaca-style prompt template, loss masked to response tokens only). 3 epochs on 1× RTX PRO 6000 Blackwell Workstation Edition, AdamW with peak LR 5e-5 + cosine decay, ~17 min wall-clock. Use **Alpaca-style prompts** with this checkpoint: ``` Below is an instruction that describes a task. Write a response that appropriately completes the request. ### Instruction: What is photosynthesis? ### Response: ``` The base model produces incoherent output on instructions; the SFT variant *attempts* to answer them. It still has the underlying limitations of a 13B-token-pretrained 1B model — broken grammar, repetition, factual errors — but it stays on topic and follows the format. See the source repo's `scripts/chat.py --format alpaca` for a REPL. ## Training procedure | field | value | |------------------|-------| | hardware | 2× NVIDIA RTX PRO 6000 Blackwell Workstation Edition (PCIe, no NVLink) on RunPod | | framework | PyTorch 2.9 + HF `accelerate launch`, bf16 mixed precision | | optimizer | AdamW, β=(0.9, 0.95), wd=0.1 | | LR schedule | linear warmup → cosine decay, peak ≈ 3e-4 | | per-proc batch | 8 | | grad accumulation| 16 | | global tokens / step | 262,144 (8 × 16 × 1024 × 2 GPUs) | | total iters | 50,000 | | total tokens | ~13.1B | | wall-clock | ~73 hours | | NCCL | `NCCL_P2P_DISABLE=1` (PCIe-only Blackwell) | ## How to load The published checkpoint is a `state_dict` saved with `torch.save` — it needs the `GPT` model class from the source repo to reconstruct. ```python import torch from huggingface_hub import hf_hub_download from model import GPT # from the source repo ckpt_path = hf_hub_download("geoffsee/auto-g-nano-1b", "model_latest.pt") sd = torch.load(ckpt_path, map_location="cpu", weights_only=True) model = GPT( vocab_size=50257, n_embd=1792, n_layer=24, n_head=14, n_kv_head=2, ffn_hidden=5376, block_size=1024, dropout=0.0, ) model.load_state_dict(sd) model.eval() ``` Or just use the source repo's `generate.py` / `scripts/test_inference.py`. ## Sample generations Greedy / top-k sampling, T=0.8, top_k=40, 120 new tokens: > **Once upon a time, in a small village near the mountains,** lived two best > friends named Timmy the Turtle and Sally the Squirrel. They loved exploring > the forest together and learning new things! One sunny day, while walking > through the forest, they stumbled upon a magical garden filled with colorful > flowers. Timmy was excited to try something new and was curious about how > his plants adapted to different types of rocks and soil… > **def fibonacci(n):** > ```python > ''' > Return the number of times n can be Fibonacci numbers with a > given number of factors. > ''' > return int(n/2) > ``` > **Question: Why is the sky blue?** > **Answer:** Blue is an invisible colour which makes the sky look blue. The > human eye makes very little light when it is in the middle of the spectrum. > Some parts of the spectrum are red while others are yellow… The 12-prompt smoke test (6 prompts × 2 temperatures) scores **9 OK / 3 WEAK / 0 FAIL** under the source repo's heuristic verdict checks. ## Limitations - **Undertrained.** ~13B tokens is well below the Chinchilla-optimal ~21B for a 1B model, and orders of magnitude below modern best practice (Llama-3 / Mistral / SmolLM-1.7B all use 1T+). - **Greedy / low-temperature sampling produces repetition loops** — a classic undertrained-model failure. Use T ≥ 0.7 with top_k. - **Hallucinates confidently.** Will invent technical-sounding terms ("phototenoids") in factual contexts. - **No instruction tuning.** This is a base model only; it doesn't follow instructions, refuse harmful requests, or hold a chat. - **Code generation is shallow.** It produces syntactically valid Python but the semantics are often wrong. - **English + Python only.** Other natural languages and other programming languages are out-of-distribution. ## Intended use Research, educational reference for "what does a 1B Transformer trained from scratch on a modest budget actually look like", and as a starting point for fine-tuning experiments. **Not** suitable for any production or user-facing application. ## License Apache-2.0 for the weights and code. Data subsets each retain their own licenses — see `HuggingFaceTB/smollm-corpus` and `bigcode/starcoderdata` on the Hugging Face Hub.