gpt2small-en-it-nanochat-lr1e4-bs6-cosine-webwiki-step14000

This repo stages the best benchmark-selected checkpoint from the local NanoChat EN/IT GPT-2-small-like fresh cosine run 20260522_fresh-gpt2small-lr1e4-bs6-cosine-webwiki.

What this is

  • model family: GPT-2-small-like decoder-only LM
  • parameters: ~136M
  • languages: English + Italian
  • context length: 2500
  • selected checkpoint: step_14000.pt
  • selection reason: best full repo-native CPU benchmark result among the checked saved checkpoints from this collapsed run family

Best in-run validation

  • step: 21000
  • validation loss: 4.0772289262
  • validation perplexity: 58.9818002652
  • validation batches: 128

Important caveat: the best online validation checkpoint was not the release choice. A later repo-native benchmark plus probe review favored step_14000.pt over the online minimum at step_21000.pt.

Benchmark summary

Repo-native benchmark suite: configs/eval/20260521_pretrain_minimal_en_it_webwiki_step11000.yaml

  • val_loss_mixed: 5.4492658957
  • ppl_mixed: 232.5873598502
  • val_loss_en: 5.2534844740
  • ppl_en: 191.2314499043
  • val_loss_it: 4.3484794742
  • ppl_it: 77.3607444413
  • loop_rate: 0.65
  • repeated_4gram_rate: 0.95
  • cloze_en_contains: 0.02
  • cloze_it_contains: 0.14

Why this checkpoint

The run stayed alive much longer than the earlier 2e-4 cosine attempt, but it still degraded late:

  • step_14000 online validation: 4.1088
  • step_21000 online validation: 4.0772 (best_validation.json)
  • step_22000 online validation: 5.7309
  • step_23000 online validation: 4.9233

That online curve alone would suggest 21000. The post-run repo-native CPU benchmark said otherwise:

  • step_14000 benchmark val_loss_mixed: 5.4493
  • step_21000 benchmark val_loss_mixed: 5.9986
  • step_22000 benchmark val_loss_mixed: 7.0789

Probe behavior also backed the earlier checkpoint:

  • step_14000 is repetitive but still closer to topic
  • step_21000 already drifts more and looks less robust
  • step_22000 is clearly broken

So this release intentionally publishes step_14000.pt rather than the raw online-validation winner.

Training/data provenance

  • training config: training_config.yaml
  • tokenizer: tokenizer.json + tokenizer_meta.json
  • packed dataset root used by the run: /mnt/apps/llm-nanochat/datasets/202605141153_fineweb50_wiki50_50en_50it_score100_2500context_5Btokens_tok_20260515_en50it50_webwiki_stratified_500M
  • tokenizer root used by the run: /mnt/apps/llm-nanochat/tokenizers/tokenizer_20260515_en50it50_webwiki_stratified_500M

Included files

  • step_14000.pt
  • step_14000.safetensors
  • step_14000.safetensors.json
  • training_config.yaml
  • tokenizer.json
  • tokenizer_meta.json
  • best_validation.json
  • eval_summary.json
  • comparison.json
  • benchmark_report.md
  • probe_step14000_summary.json
  • full run telemetry snapshots: eval_metrics.jsonl, metrics.jsonl, probe_generations.jsonl

Probe reading at step 14000

  • EN factual prompt The capital of Italy is -> Rome: very weak (rank=877, prob=0.0000968)
  • EN simple continuation A small language model should -> be: strong (rank=1, prob=0.5508)
  • IT factual prompt La capitale d'Italia è -> Roma: very weak (rank=665, prob=0.0001030)
  • IT simple continuation Un piccolo modello linguistico dovrebbe -> essere: strong (rank=1, prob=0.2412)

These probes are directional evidence only. The main selection rule here is the repo-native benchmark result plus the qualitative collapse review.

Usage

This project uses a custom NanoChat inference/training stack. The easiest local UI in the source repo is the Chainlit checkpoint tester documented in the repo README.

Limitations

  • mixed quality is still in the weak/intermediate band
  • generations remain heavily repetitive
  • factual recall is weak in both languages
  • this checkpoint is the best preserved artifact inside this collapsed run family, not a claim of broad downstream excellence
  • dataset redistribution for the full training corpus may have separate licensing constraints; this repo contains model artifacts, not the raw/prepared training corpus
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support