gpt2small-en-it-nanochat-lr2e4-bs6-wsd-fastdecay-step10000
This repo stages the best saved checkpoint from the local NanoChat EN/IT GPT-2-small-like run 20260517_stable-config-recipe-v5-gpt2small-lr2e4-batchmaxpossible-bs6-wsd-fastdecay.
What this is
- model family: GPT-2-small-like decoder-only LM
- parameters: ~136M
- languages: English + Italian
- context length: 2500
- selected checkpoint:
step_10000.pt - selection reason: lowest recorded validation loss among saved checkpoints in
best_validation.json
Best validation
- step: 10000
- validation loss: 3.8945770748
- validation perplexity: 49.1352684243
- validation batches: 128
Important caveat
This checkpoint is the best validation checkpoint within this run family. It is a useful intermediate bilingual pretraining artifact, not a polished factual assistant model.
Training/data provenance
- training config:
training_config.yaml - tokenizer:
tokenizer.json+tokenizer_meta.json - packed dataset root used by the run:
/mnt/apps/llm-nanochat/datasets/202605011052_fresh_50_50_score100_2500_sourcebalanced - tokenizer root used by the run:
/mnt/apps/llm-nanochat/tokenizers/tok_202605011052_fresh_50_50_score100_32k_fromscratch
Included files
step_10000.ptstep_10000.safetensorsstep_10000.safetensors.jsontraining_config.yamltokenizer.jsontokenizer_meta.jsonbest_validation.jsoneval_summary.jsonprobe_step10000_summary.json- full run telemetry snapshots:
eval_metrics.jsonl,metrics.jsonl,probe_generations.jsonl
Probe reading at step 10000
The run includes probe telemetry, but the stored payload for this experiment is legacy/partial: the probe_generations.jsonl entries at step 10000 keep prompts and expected continuations, while generated text / target-rank fields are null. So this release does not make strong probe-quality claims from those rows.
Usage
This project uses a custom NanoChat inference/training stack. The easiest local UI in the source repo is the Chainlit checkpoint tester documented in the repo README.
Limitations
- factual recall is still limited
- generations may become repetitive
- the model was selected by validation loss inside this run family, not by broad downstream benchmark performance
- dataset redistribution for the full training corpus may have separate licensing constraints; this repo contains model artifacts, not the raw/prepared training corpus