YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

SVEN-10M

A 10M parameter language model trained from scratch on real data for ~$1.

SVEN-10M is the first checkpoint in the SVEN model family, built entirely from scratch - custom tokenizer, custom architecture, custom training loop. No fine-tuning. No LoRA. Trained on real public datasets on a single RTX 3090 GPU.


Model Details

Architecture Decoder-only transformer (LLaMA-style)
Parameters 11,537,664 (~11.5M)
Context length 256 tokens
Vocabulary 32,000 (BPE, trained on this corpus)
Layers 6
Hidden size 256
Attention heads 8 (2 KV heads, GQA)
Activation SwiGLU
Positional encoding RoPE
Normalization RMSNorm
Training steps 1,000
Training loss 10.41 โ†’ 6.90
Precision bfloat16
GPU 1ร— NVIDIA RTX 3090 (24GB)
Training cost ~$1

Training Data

Trained on a curated English-only mix of 116M tokens from 5 public sources:

Source Documents Content
FineWeb-Edu 49,993 High-quality educational web text
Wikipedia (EN) 19,981 English Wikipedia articles
OpenWebMath 14,918 Mathematical reasoning and problems
Python code 18,390 Python programming instructions
JavaScript code 15,000 JavaScript code samples
Total 118,282 116M tokens

All data filtered for English (ASCII ratio + common word check), quality-filtered, and deduplicated before training.

Tokenizer: Custom BPE tokenizer trained on this exact corpus using SentencePiece. 32,000 vocab size.


Architecture Notes

SVEN-10M uses a modern LLaMA-style architecture rather than the original GPT-2 design:

  • RoPE instead of learned positional embeddings
  • RMSNorm instead of LayerNorm (faster, no mean subtraction)
  • SwiGLU instead of GELU (better gradient flow)
  • Grouped Query Attention (2 KV heads vs 8 Q heads, 4ร— memory saving)
  • Weight-tied embeddings (input and output projection share weights)
  • No bias in linear layers

Training Details

Optimizer:           AdamW
Learning rate:       3e-4 (cosine decay to 3e-5)
Warmup steps:        100
Weight decay:        0.1
Gradient clip:       1.0
Batch size:          8
Sequence length:     256
Training steps:      1,000

Loss curve:

step 0:    10.41  (random initialization, expected โ‰ˆ log(32000) = 10.37)
step 100:   8.04  (fast early learning)
step 500:   7.00  (solid convergence)
step 1000:  6.90  (final)

Intended Use

SVEN-10M is a proof-of-concept research model and the smoke-test checkpoint for the SVEN model family.

It is intended for:

  • Verifying the full from-scratch training pipeline works end to end
  • Educational reference for how to train a small LLM from scratch
  • Experimentation and architecture ablations at low cost

It is not intended for:

  • Production use
  • Any task requiring factual accuracy
  • Code generation in real projects
  • Replacing larger, properly trained models

At 11M parameters trained for 1,000 steps on 116M tokens, this model has seen far too little data to be reliable for any real task. It can generate English-like text and shows it has learned basic language patterns, but should be treated as a research artifact.


Limitations

  • 11M parameters is very small - the model cannot hold much world knowledge
  • Only 1,000 training steps - significantly undertrained
  • 116M tokens is a small corpus for pretraining
  • Context window of 256 tokens limits long-form understanding
  • No instruction tuning, RLHF, or alignment of any kind
  • Not evaluated on standard benchmarks (ARC, HellaSwag, PIQA)

What's Next

SVEN-175M is the full-scale model in this family:

  • 175M parameters
  • Same architecture, scaled up
  • Trained on the same corpus for 150,000+ steps
  • Target: sriksven/sven-175m

Files

File Description
model.pt Full model checkpoint (weights + optimizer state)
tokenizer.model SentencePiece BPE tokenizer model
tokenizer.vocab Tokenizer vocabulary
config.yaml Model architecture config

Citation

@misc{sven-10m,
  author    = {Sri Krishna Venkatesh},
  title     = {SVEN-10M: A 10M Parameter LLM Trained from Scratch},
  year      = {2025},
  publisher = {Hugging Face},
  url       = {https://huggingface.co/sriksven/sven-10m}
}

About

SVEN stands for Sri Krishna Venkatesh - hidden in plain sight.

Built from scratch. No shortcuts.

Downloads last month
11
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support