--- license: apache-2.0 datasets: - HuggingFaceFW/fineweb-edu - HuggingFaceTB/smollm-corpus - epfml/FineWeb-HQ - openbmb/Ultra-FineWeb language: - en library_name: transformers tags: - language-model - transformer - rope - swiglu - custom-architecture - custom-tokenizer - xgqa pipeline_tag: text-generation --- ![axiomic banner](AxiomicBanner.png) # GPT-S-1.4M GPT-S-1.4M a first-generation model in the GPT-S small-model family: 1.4M parameters, 6B training tokens, a custom 4K tokenizer, 5 layers, and all new Exclusive Grouped-query Attention (XGQA), trained from scratch on a 5-source corpus. See how it compares to similar models here: [Open SLM Leaderboard](https://huggingface.co/spaces/AxiomicLabs/Open_SLM_Leaderboard) ## Benchmarks All evaluations use zero-shot multiple-choice scoring. Normalized accuracy is reported where available. | Benchmark | Score | |---|---:| | HellaSwag | 26.89% | | ARC-Easy | 31.57% | | ARC-Challenge | 21.93% | | ARC Average | 26.75% | | PIQA | 55.17% | | ArithMark 2 | 25.16% | ## Architecture | Component | Details | |---|---| | Position encoding | RoPE, theta=2,500 | | Normalization | RMSNorm | | Feed-forward | SwiGLU | | Attention | Exclusive Grouped-query attention, 4 query heads / 2 KV heads | | Embeddings | Weight tied | | Context length | 384 tokens | ### Config ```text vocab_size = 4,096 hidden_size = 128 num_layers = 5 num_heads = 4 num_kv_heads = 2 head_dim = 32 intermediate = 341 block_size = 384 rope_theta = 2,500 ``` ## Training GPT-S-1.4M was trained from scratch for 6B tokens on a mixed English corpus built around educational web text, synthetic textbook-style material, and higher-quality web text. | Source | Dataset | Mix | Purpose | |---|---|---:|---| | FineWeb-Edu | [HuggingFaceFW/fineweb-edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu) | 15% | Primary educational web text | | Cosmopedia v2 | [HuggingFaceTB/smollm-corpus](https://huggingface.co/datasets/HuggingFaceTB/smollm-corpus) | 30% | Synthetic textbook-style coverage | | FineWeb-HQ | [epfml/FineWeb-HQ](https://huggingface.co/datasets/epfml/FineWeb-HQ) | 20% | Higher-quality general web text | | Ultra-FineWeb QA | [openbmb/Ultra-FineWeb](https://huggingface.co/datasets/openbmb/Ultra-FineWeb) L3 English QA slice | 20% | Question-answer style web text | | Ultra-FineWeb Multi-style | [openbmb/Ultra-FineWeb](https://huggingface.co/datasets/openbmb/Ultra-FineWeb) L3 English multi-style slice | 15% | Broader writing-style coverage | ### Hyperparameters | Hyperparameter | Value | |---|---:| | Optimizer | AdamW | | Adam betas | 0.9 / 0.95 | | Weight decay | 0.01 | | Peak learning rate | 3.5e-3 | | Minimum learning rate | 0 | | LR schedule | Warmup-stable-decay | | Warmup steps | 2,000 | | Decay start | 80% of configured training run | | Training tokens | 6B | | Total batch size | 294,912 tokens | | Microbatch | 256 x 384 tokens | | Gradient accumulation steps | 3 | | Gradient clipping | 1.0 | | Precision | bfloat16 autocast | ## Usage ```python from transformers import AutoModelForCausalLM, AutoTokenizer import torch model_name = "AxiomicLabs/GPT-S-1.4M" tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained( model_name, trust_remote_code=True, torch_dtype=torch.bfloat16, device_map="auto", ) prompt = "The future of AI is" inputs = tokenizer(prompt, return_tensors="pt").to(model.device) with torch.inference_mode(): output = model.generate( **inputs, max_new_tokens=80, do_sample=True, temperature=0.8, top_p=0.95, repetition_penalty=1.1, no_repeat_ngram_size=4, ) print(tokenizer.decode(output[0], skip_special_tokens=True)) ``` ## Limitations This is a very small base language model. It is not instruction tuned, has limited factual capacity, and uses a 384-token context window.