---
license: apache-2.0
language:
- en
- code
library_name: pytorch
pipeline_tag: text-generation
tags:
- text-generation
- transformer
- gpt
- llama
- 1b
- pretraining
- from-scratch
datasets:
- HuggingFaceTB/smollm-corpus
- bigcode/starcoderdata
---

# auto-g-nano-1b

A from-scratch ~**1.05B** parameter decoder-only Transformer, pre-trained on a
mixed web + synthetic + code corpus. This is a research / educational model
trained on a deliberately small token budget (~13B tokens, roughly 60% of
Chinchilla-optimal for 1B params), so it should be evaluated as a "hello
world" 1B base model, not a production assistant.

Source code: https://github.com/geoffsee/auto-g-nano (branch
`claude/build-billion-param-model-cOPdo`).

## Architecture

Llama-style deeper-narrower decoder. RMSNorm + RoPE + Grouped-Query Attention
+ SwiGLU FFN. Untied embeddings.

| field           | value |
|-----------------|-------|
| total params    | **1,050,002,688 (1.050B)** |
| layers          | 24 |
| embedding dim   | 1792 |
| query heads     | 14 |
| KV heads (GQA)  | 2 (7× key/value sharing) |
| head dim        | 128 |
| FFN hidden      | 5376 (3×d, SwiGLU) |
| context length  | 1024 |
| vocab           | 50,257 (tiktoken `gpt2` BPE) |
| RoPE θ          | 500,000 |
| precision       | bf16 (training) |

## Training data

A 3-way interleaved stream (HF `interleave_datasets` with weights):

| weight | source                                              |
|-------:|-----------------------------------------------------|
| 0.40   | `HuggingFaceTB/smollm-corpus` / `fineweb-edu-dedup` |
| 0.25   | `HuggingFaceTB/smollm-corpus` / `cosmopedia-v2`     |
| 0.35   | `bigcode/starcoderdata` / `python` (gated)          |

The code share is intentionally aggressive (~35%) compared to SmolLM's
natural ratio (~2%), so a 1B model trained on a small token budget actually
picks up code patterns instead of treating them as noise.

## Variants in this repo

| file                       | size  | notes |
|----------------------------|------:|-------|
| `model_latest.pt`          | 4.2 GB | base, fp32 (original training output) |
| `model_bf16.pt`            | 2.0 GB | base, bf16 PyTorch |
| `model.safetensors`        | 2.0 GB | base, bf16 safetensors (recommended for inference) |
| `model.onnx` + `.onnx.data`| 2.0 GB | base, fp16 ONNX (CPU/MPS-friendly) |
| **`model_sft.safetensors`** | **2.0 GB** | **SFT'd on `databricks/databricks-dolly-15k`** — instruction-following |

### SFT variant

Brief supervised fine-tuning on the [Dolly-15k](https://huggingface.co/datasets/databricks/databricks-dolly-15k)
instruction dataset (15,011 instruction/response pairs, Alpaca-style prompt template,
loss masked to response tokens only). 3 epochs on 1× RTX PRO 6000 Blackwell Workstation
Edition, AdamW with peak LR 5e-5 + cosine decay, ~17 min wall-clock.

Use **Alpaca-style prompts** with this checkpoint:

```
Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
What is photosynthesis?

### Response:
```

The base model produces incoherent output on instructions; the SFT variant *attempts*
to answer them. It still has the underlying limitations of a 13B-token-pretrained 1B
model — broken grammar, repetition, factual errors — but it stays on topic and follows
the format. See the source repo's `scripts/chat.py --format alpaca` for a REPL.

## Training procedure

| field            | value |
|------------------|-------|
| hardware         | 2× NVIDIA RTX PRO 6000 Blackwell Workstation Edition (PCIe, no NVLink) on RunPod |
| framework        | PyTorch 2.9 + HF `accelerate launch`, bf16 mixed precision |
| optimizer        | AdamW, β=(0.9, 0.95), wd=0.1 |
| LR schedule      | linear warmup → cosine decay, peak ≈ 3e-4 |
| per-proc batch   | 8 |
| grad accumulation| 16 |
| global tokens / step | 262,144 (8 × 16 × 1024 × 2 GPUs) |
| total iters      | 50,000 |
| total tokens     | ~13.1B |
| wall-clock       | ~73 hours |
| NCCL             | `NCCL_P2P_DISABLE=1` (PCIe-only Blackwell) |

## How to load

The published checkpoint is a `state_dict` saved with `torch.save` — it
needs the `GPT` model class from the source repo to reconstruct.

```python
import torch
from huggingface_hub import hf_hub_download
from model import GPT  # from the source repo

ckpt_path = hf_hub_download("geoffsee/auto-g-nano-1b", "model_latest.pt")
sd = torch.load(ckpt_path, map_location="cpu", weights_only=True)

model = GPT(
    vocab_size=50257, n_embd=1792, n_layer=24, n_head=14, n_kv_head=2,
    ffn_hidden=5376, block_size=1024, dropout=0.0,
)
model.load_state_dict(sd)
model.eval()
```

Or just use the source repo's `generate.py` / `scripts/test_inference.py`.

## Sample generations

Greedy / top-k sampling, T=0.8, top_k=40, 120 new tokens:

> **Once upon a time, in a small village near the mountains,** lived two best
> friends named Timmy the Turtle and Sally the Squirrel. They loved exploring
> the forest together and learning new things! One sunny day, while walking
> through the forest, they stumbled upon a magical garden filled with colorful
> flowers. Timmy was excited to try something new and was curious about how
> his plants adapted to different types of rocks and soil…

> **def fibonacci(n):**
> ```python
>     '''
>     Return the number of times n can be Fibonacci numbers with a
>     given number of factors.
>     '''
>     return int(n/2)
> ```

> **Question: Why is the sky blue?**
> **Answer:** Blue is an invisible colour which makes the sky look blue. The
> human eye makes very little light when it is in the middle of the spectrum.
> Some parts of the spectrum are red while others are yellow…

The 12-prompt smoke test (6 prompts × 2 temperatures) scores **9 OK / 3 WEAK /
0 FAIL** under the source repo's heuristic verdict checks.

## Limitations

- **Undertrained.** ~13B tokens is well below the Chinchilla-optimal ~21B
  for a 1B model, and orders of magnitude below modern best practice
  (Llama-3 / Mistral / SmolLM-1.7B all use 1T+).
- **Greedy / low-temperature sampling produces repetition loops** — a
  classic undertrained-model failure. Use T ≥ 0.7 with top_k.
- **Hallucinates confidently.** Will invent technical-sounding terms
  ("phototenoids") in factual contexts.
- **No instruction tuning.** This is a base model only; it doesn't follow
  instructions, refuse harmful requests, or hold a chat.
- **Code generation is shallow.** It produces syntactically valid Python but
  the semantics are often wrong.
- **English + Python only.** Other natural languages and other programming
  languages are out-of-distribution.

## Intended use

Research, educational reference for "what does a 1B Transformer trained
from scratch on a modest budget actually look like", and as a starting
point for fine-tuning experiments. **Not** suitable for any production
or user-facing application.

## License

Apache-2.0 for the weights and code. Data subsets each retain their own
licenses — see `HuggingFaceTB/smollm-corpus` and `bigcode/starcoderdata`
on the Hugging Face Hub.