Instructions to use veyra-ai/veyra-30m-base-2.5b-tokens with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use veyra-ai/veyra-30m-base-2.5b-tokens with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="veyra-ai/veyra-30m-base-2.5b-tokens", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("veyra-ai/veyra-30m-base-2.5b-tokens", trust_remote_code=True, dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use veyra-ai/veyra-30m-base-2.5b-tokens with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "veyra-ai/veyra-30m-base-2.5b-tokens"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "veyra-ai/veyra-30m-base-2.5b-tokens",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/veyra-ai/veyra-30m-base-2.5b-tokens

SGLang

How to use veyra-ai/veyra-30m-base-2.5b-tokens with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "veyra-ai/veyra-30m-base-2.5b-tokens" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "veyra-ai/veyra-30m-base-2.5b-tokens",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "veyra-ai/veyra-30m-base-2.5b-tokens" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "veyra-ai/veyra-30m-base-2.5b-tokens",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use veyra-ai/veyra-30m-base-2.5b-tokens with Docker Model Runner:
```
docker model run hf.co/veyra-ai/veyra-30m-base-2.5b-tokens
```

A newer version of this model is available: veyra-ai/veyra-30m-base-5b-tokens

Veyra 30M Base 2.5B Checkpoint !! 5B CHECKPOINT OUT NOW !!

This is an early Veyra-30M base checkpoint trained for approximately 2.5B pretraining tokens.

It is not instruction tuned and should not be evaluated like a finished chat assistant. It is expected to hallucinate, repeat, fail simple factual/math prompts, and continue text in odd ways. This checkpoint is uploaded for transparency, reproducibility, and milestone tracking before further continuation training. The model is also not well optimized for inference use yet and will be very slow, check out modeling_veyra.py for more details.

Training summary

Approximate training stages:

1B tokens: Cosmopedia v2 bootstrap pretraining.
+1.5B tokens: mixed continuation using Cosmopedia-v2 repository configs including cosmopedia-v2, fineweb-edu-dedup, and python-edu.
Total: about 2.5B pretraining tokens.

Architecture

Veyra-30M is a small attention-sparse decoder-only language model.

Key details:

Exact parameters: 31,988,224 / 31.99M
Vocabulary: 8,192 tokens
Hidden size: 512
Layers: 8
Layer pattern: AMAMAMAM
- A = attention + MLP block
- M = MLP-only block
Attention heads: 8 query heads, 2 KV heads
MLP intermediate size: 2048
Activation: SwiGLU
Normalization: RMSNorm
Position encoding: RoPE
Tied token embeddings / LM head
Context in this checkpoint: 512 tokens

Loading

This repository uses custom Transformers code.

Minimal usage:

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

repo = "veyra-ai/veyra-30m-base-2.5b-tokens"

tokenizer = AutoTokenizer.from_pretrained(repo, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(repo, trust_remote_code=True, dtype=torch.float32)
model.eval()

prompt = "Photosynthesis is the process by which"
input_ids = tokenizer.encode(prompt, add_special_tokens=False, return_tensors="pt")

with torch.no_grad():
    out = model.generate(
        input_ids,
        do_sample=True,
        temperature=0.5,
        top_k=30,
        repetition_penalty=1.15,
        no_repeat_ngram_size=2,
        max_new_tokens=80,
    )

print(tokenizer.decode(out[0], skip_special_tokens=True))

For raw completion prompts, use add_special_tokens=False.

Optimizer

Training used:

CosineGatedAdam / CGA-v0 on 2D projection matrices
AdamW on embeddings, norms, tied head, and auxiliary parameters

Intended use

This checkpoint is primarily for:

continued pretraining
research / ablations
tracking Veyra training milestones
testing tiny model behavior

It is not intended for production use or reliable factual answering.

Known limitations

This model can:

hallucinate confidently
repeat phrases
fail arithmetic
fail simple factual questions
produce fake code
continue in textbook-like or tutorial-like styles

Further continuation pretraining and post-training are planned.

Downloads last month: 3,384

Safetensors

Model size

36.2M params

Tensor type

F32

Dataset used to train veyra-ai/veyra-30m-base-2.5b-tokens

Collection including veyra-ai/veyra-30m-base-2.5b-tokens

Veyra (Legacy)

Collection

The first version of Veyra, these models are meant for local CPU inference. • 3 items • Updated 11 days ago