SVEN-10M

A 10M parameter language model trained from scratch on real data for ~$1.

SVEN-10M is the first checkpoint in the SVEN model family, built entirely from scratch - custom tokenizer, custom architecture, custom training loop. No fine-tuning. No LoRA. Trained on real public datasets on a single RTX 3090 GPU.

Model Details


Architecture	Decoder-only transformer (LLaMA-style)
Parameters	11,537,664 (~11.5M)
Context length	256 tokens
Vocabulary	32,000 (BPE, trained on this corpus)
Layers	6
Hidden size	256
Attention heads	8 (2 KV heads, GQA)
Activation	SwiGLU
Positional encoding	RoPE
Normalization	RMSNorm
Training steps	1,000
Training loss	10.41 → 6.90
Precision	bfloat16
GPU	1× NVIDIA RTX 3090 (24GB)
Training cost	~$1

Training Data

Trained on a curated English-only mix of 116M tokens from 5 public sources:

Source	Documents	Content
FineWeb-Edu	49,993	High-quality educational web text
Wikipedia (EN)	19,981	English Wikipedia articles
OpenWebMath	14,918	Mathematical reasoning and problems
Python code	18,390	Python programming instructions
JavaScript code	15,000	JavaScript code samples
Total	118,282	116M tokens

All data filtered for English (ASCII ratio + common word check), quality-filtered, and deduplicated before training.

Tokenizer: Custom BPE tokenizer trained on this exact corpus using SentencePiece. 32,000 vocab size.

Architecture Notes

SVEN-10M uses a modern LLaMA-style architecture rather than the original GPT-2 design:

RoPE instead of learned positional embeddings
RMSNorm instead of LayerNorm (faster, no mean subtraction)
SwiGLU instead of GELU (better gradient flow)
Grouped Query Attention (2 KV heads vs 8 Q heads, 4× memory saving)
Weight-tied embeddings (input and output projection share weights)
No bias in linear layers

Training Details

Optimizer:           AdamW
Learning rate:       3e-4 (cosine decay to 3e-5)
Warmup steps:        100
Weight decay:        0.1
Gradient clip:       1.0
Batch size:          8
Sequence length:     256
Training steps:      1,000

Loss curve:

step 0:    10.41  (random initialization, expected ≈ log(32000) = 10.37)
step 100:   8.04  (fast early learning)
step 500:   7.00  (solid convergence)
step 1000:  6.90  (final)

Intended Use

SVEN-10M is a proof-of-concept research model and the smoke-test checkpoint for the SVEN model family.

It is intended for:

Verifying the full from-scratch training pipeline works end to end
Educational reference for how to train a small LLM from scratch
Experimentation and architecture ablations at low cost

It is not intended for:

Production use
Any task requiring factual accuracy
Code generation in real projects
Replacing larger, properly trained models

At 11M parameters trained for 1,000 steps on 116M tokens, this model has seen far too little data to be reliable for any real task. It can generate English-like text and shows it has learned basic language patterns, but should be treated as a research artifact.

Limitations

11M parameters is very small - the model cannot hold much world knowledge
Only 1,000 training steps - significantly undertrained
116M tokens is a small corpus for pretraining
Context window of 256 tokens limits long-form understanding
No instruction tuning, RLHF, or alignment of any kind
Not evaluated on standard benchmarks (ARC, HellaSwag, PIQA)

What's Next

SVEN-175M is the full-scale model in this family:

175M parameters
Same architecture, scaled up
Trained on the same corpus for 150,000+ steps
Target: sriksven/sven-175m

Files

File	Description
`model.pt`	Full model checkpoint (weights + optimizer state)
`tokenizer.model`	SentencePiece BPE tokenizer model
`tokenizer.vocab`	Tokenizer vocabulary
`config.yaml`	Model architecture config

Citation

@misc{sven-10m,
  author    = {Sri Krishna Venkatesh},
  title     = {SVEN-10M: A 10M Parameter LLM Trained from Scratch},
  year      = {2025},
  publisher = {Hugging Face},
  url       = {https://huggingface.co/sriksven/sven-10m}
}

About

SVEN stands for Sri Krishna Venkatesh - hidden in plain sight.

Built from scratch. No shortcuts.

Downloads last month: 11

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support