A newer version of this model is available: veyra-ai/veyra-30m-base-5b-tokens

Veyra 30M Base 2.5B Checkpoint !! 5B CHECKPOINT OUT NOW !!

This is an early Veyra-30M base checkpoint trained for approximately 2.5B pretraining tokens.

It is not instruction tuned and should not be evaluated like a finished chat assistant. It is expected to hallucinate, repeat, fail simple factual/math prompts, and continue text in odd ways. This checkpoint is uploaded for transparency, reproducibility, and milestone tracking before further continuation training. The model is also not well optimized for inference use yet and will be very slow, check out modeling_veyra.py for more details.

Training summary

Approximate training stages:

  • 1B tokens: Cosmopedia v2 bootstrap pretraining.
  • +1.5B tokens: mixed continuation using Cosmopedia-v2 repository configs including cosmopedia-v2, fineweb-edu-dedup, and python-edu.
  • Total: about 2.5B pretraining tokens.

Architecture

Veyra-30M is a small attention-sparse decoder-only language model.

Key details:

  • Exact parameters: 31,988,224 / 31.99M
  • Vocabulary: 8,192 tokens
  • Hidden size: 512
  • Layers: 8
  • Layer pattern: AMAMAMAM
    • A = attention + MLP block
    • M = MLP-only block
  • Attention heads: 8 query heads, 2 KV heads
  • MLP intermediate size: 2048
  • Activation: SwiGLU
  • Normalization: RMSNorm
  • Position encoding: RoPE
  • Tied token embeddings / LM head
  • Context in this checkpoint: 512 tokens

Loading

This repository uses custom Transformers code.

Minimal usage:

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

repo = "veyra-ai/veyra-30m-base-2.5b-tokens"

tokenizer = AutoTokenizer.from_pretrained(repo, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(repo, trust_remote_code=True, dtype=torch.float32)
model.eval()

prompt = "Photosynthesis is the process by which"
input_ids = tokenizer.encode(prompt, add_special_tokens=False, return_tensors="pt")

with torch.no_grad():
    out = model.generate(
        input_ids,
        do_sample=True,
        temperature=0.5,
        top_k=30,
        repetition_penalty=1.15,
        no_repeat_ngram_size=2,
        max_new_tokens=80,
    )

print(tokenizer.decode(out[0], skip_special_tokens=True))

For raw completion prompts, use add_special_tokens=False.

Optimizer

Training used:

  • CosineGatedAdam / CGA-v0 on 2D projection matrices
  • AdamW on embeddings, norms, tied head, and auxiliary parameters

Intended use

This checkpoint is primarily for:

  • continued pretraining
  • research / ablations
  • tracking Veyra training milestones
  • testing tiny model behavior

It is not intended for production use or reliable factual answering.

Known limitations

This model can:

  • hallucinate confidently
  • repeat phrases
  • fail arithmetic
  • fail simple factual questions
  • produce fake code
  • continue in textbook-like or tutorial-like styles

Further continuation pretraining and post-training are planned.

Downloads last month
3,384
Safetensors
Model size
36.2M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train veyra-ai/veyra-30m-base-2.5b-tokens

Collection including veyra-ai/veyra-30m-base-2.5b-tokens