gpt2small-en-it-nanochat-lr1e4-bs6-cosine-webwiki-step14000
This repo stages the best benchmark-selected checkpoint from the local NanoChat EN/IT GPT-2-small-like fresh cosine run 20260522_fresh-gpt2small-lr1e4-bs6-cosine-webwiki.
What this is
- model family: GPT-2-small-like decoder-only LM
- parameters: ~136M
- languages: English + Italian
- context length: 2500
- selected checkpoint:
step_14000.pt - selection reason: best full repo-native CPU benchmark result among the checked saved checkpoints from this collapsed run family
Best in-run validation
- step: 21000
- validation loss: 4.0772289262
- validation perplexity: 58.9818002652
- validation batches: 128
Important caveat: the best online validation checkpoint was not the release choice. A later repo-native benchmark plus probe review favored step_14000.pt over the online minimum at step_21000.pt.
Benchmark summary
Repo-native benchmark suite: configs/eval/20260521_pretrain_minimal_en_it_webwiki_step11000.yaml
val_loss_mixed: 5.4492658957ppl_mixed: 232.5873598502val_loss_en: 5.2534844740ppl_en: 191.2314499043val_loss_it: 4.3484794742ppl_it: 77.3607444413loop_rate: 0.65repeated_4gram_rate: 0.95cloze_en_contains: 0.02cloze_it_contains: 0.14
Why this checkpoint
The run stayed alive much longer than the earlier 2e-4 cosine attempt, but it still degraded late:
step_14000online validation:4.1088step_21000online validation:4.0772(best_validation.json)step_22000online validation:5.7309step_23000online validation:4.9233
That online curve alone would suggest 21000. The post-run repo-native CPU benchmark said otherwise:
step_14000benchmarkval_loss_mixed:5.4493step_21000benchmarkval_loss_mixed:5.9986step_22000benchmarkval_loss_mixed:7.0789
Probe behavior also backed the earlier checkpoint:
step_14000is repetitive but still closer to topicstep_21000already drifts more and looks less robuststep_22000is clearly broken
So this release intentionally publishes step_14000.pt rather than the raw online-validation winner.
Training/data provenance
- training config:
training_config.yaml - tokenizer:
tokenizer.json+tokenizer_meta.json - packed dataset root used by the run:
/mnt/apps/llm-nanochat/datasets/202605141153_fineweb50_wiki50_50en_50it_score100_2500context_5Btokens_tok_20260515_en50it50_webwiki_stratified_500M - tokenizer root used by the run:
/mnt/apps/llm-nanochat/tokenizers/tokenizer_20260515_en50it50_webwiki_stratified_500M
Included files
step_14000.ptstep_14000.safetensorsstep_14000.safetensors.jsontraining_config.yamltokenizer.jsontokenizer_meta.jsonbest_validation.jsoneval_summary.jsoncomparison.jsonbenchmark_report.mdprobe_step14000_summary.json- full run telemetry snapshots:
eval_metrics.jsonl,metrics.jsonl,probe_generations.jsonl
Probe reading at step 14000
- EN factual prompt
The capital of Italy is -> Rome: very weak (rank=877,prob=0.0000968) - EN simple continuation
A small language model should -> be: strong (rank=1,prob=0.5508) - IT factual prompt
La capitale d'Italia è -> Roma: very weak (rank=665,prob=0.0001030) - IT simple continuation
Un piccolo modello linguistico dovrebbe -> essere: strong (rank=1,prob=0.2412)
These probes are directional evidence only. The main selection rule here is the repo-native benchmark result plus the qualitative collapse review.
Usage
This project uses a custom NanoChat inference/training stack. The easiest local UI in the source repo is the Chainlit checkpoint tester documented in the repo README.
Limitations
- mixed quality is still in the weak/intermediate band
- generations remain heavily repetitive
- factual recall is weak in both languages
- this checkpoint is the best preserved artifact inside this collapsed run family, not a claim of broad downstream excellence
- dataset redistribution for the full training corpus may have separate licensing constraints; this repo contains model artifacts, not the raw/prepared training corpus