gpt2small-en-it-nanochat-lr1e4-bs6-cosine-webwiki-step14000

This repo stages the best benchmark-selected checkpoint from the local NanoChat EN/IT GPT-2-small-like fresh cosine run 20260522_fresh-gpt2small-lr1e4-bs6-cosine-webwiki.

What this is

model family: GPT-2-small-like decoder-only LM
parameters: ~136M
languages: English + Italian
context length: 2500
selected checkpoint: step_14000.pt
selection reason: best full repo-native CPU benchmark result among the checked saved checkpoints from this collapsed run family

Best in-run validation

step: 21000
validation loss: 4.0772289262
validation perplexity: 58.9818002652
validation batches: 128

Important caveat: the best online validation checkpoint was not the release choice. A later repo-native benchmark plus probe review favored step_14000.pt over the online minimum at step_21000.pt.

Benchmark summary

Repo-native benchmark suite: configs/eval/20260521_pretrain_minimal_en_it_webwiki_step11000.yaml

val_loss_mixed: 5.4492658957
ppl_mixed: 232.5873598502
val_loss_en: 5.2534844740
ppl_en: 191.2314499043
val_loss_it: 4.3484794742
ppl_it: 77.3607444413
loop_rate: 0.65
repeated_4gram_rate: 0.95
cloze_en_contains: 0.02
cloze_it_contains: 0.14

Why this checkpoint

The run stayed alive much longer than the earlier 2e-4 cosine attempt, but it still degraded late:

step_14000 online validation: 4.1088
step_21000 online validation: 4.0772 (best_validation.json)
step_22000 online validation: 5.7309
step_23000 online validation: 4.9233

That online curve alone would suggest 21000. The post-run repo-native CPU benchmark said otherwise:

step_14000 benchmark val_loss_mixed: 5.4493
step_21000 benchmark val_loss_mixed: 5.9986
step_22000 benchmark val_loss_mixed: 7.0789

Probe behavior also backed the earlier checkpoint:

step_14000 is repetitive but still closer to topic
step_21000 already drifts more and looks less robust
step_22000 is clearly broken

So this release intentionally publishes step_14000.pt rather than the raw online-validation winner.

Training/data provenance

training config: training_config.yaml
tokenizer: tokenizer.json + tokenizer_meta.json
packed dataset root used by the run: /mnt/apps/llm-nanochat/datasets/202605141153_fineweb50_wiki50_50en_50it_score100_2500context_5Btokens_tok_20260515_en50it50_webwiki_stratified_500M
tokenizer root used by the run: /mnt/apps/llm-nanochat/tokenizers/tokenizer_20260515_en50it50_webwiki_stratified_500M

Included files

step_14000.pt
step_14000.safetensors
step_14000.safetensors.json
training_config.yaml
tokenizer.json
tokenizer_meta.json
best_validation.json
eval_summary.json
comparison.json
benchmark_report.md
probe_step14000_summary.json
full run telemetry snapshots: eval_metrics.jsonl, metrics.jsonl, probe_generations.jsonl

Probe reading at step 14000

EN factual prompt The capital of Italy is -> Rome: very weak (rank=877, prob=0.0000968)
EN simple continuation A small language model should -> be: strong (rank=1, prob=0.5508)
IT factual prompt La capitale d'Italia è -> Roma: very weak (rank=665, prob=0.0001030)
IT simple continuation Un piccolo modello linguistico dovrebbe -> essere: strong (rank=1, prob=0.2412)

These probes are directional evidence only. The main selection rule here is the repo-native benchmark result plus the qualitative collapse review.

Usage

This project uses a custom NanoChat inference/training stack. The easiest local UI in the source repo is the Chainlit checkpoint tester documented in the repo README.

Limitations

mixed quality is still in the weak/intermediate band
generations remain heavily repetitive
factual recall is weak in both languages
this checkpoint is the best preserved artifact inside this collapsed run family, not a claim of broad downstream excellence
dataset redistribution for the full training corpus may have separate licensing constraints; this repo contains model artifacts, not the raw/prepared training corpus

Downloads last month: -; Downloads are not tracked for this model. How to track