nazdef commited on
Commit
5093d1b
·
verified ·
1 Parent(s): 8777806

Upload best checkpoint for v5 bs6 WSD fast-decay run (step 10000)

Browse files
README.md ADDED
@@ -0,0 +1,74 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ - it
5
+ license: other
6
+ library_name: custom
7
+ pipeline_tag: text-generation
8
+ tags:
9
+ - nanochat
10
+ - gpt2-small
11
+ - bilingual
12
+ - english
13
+ - italian
14
+ - pretraining
15
+ ---
16
+
17
+ # gpt2small-en-it-nanochat-lr2e4-bs6-wsd-fastdecay-step10000
18
+
19
+ This repo stages the best saved checkpoint from the local NanoChat EN/IT GPT-2-small-like run `20260517_stable-config-recipe-v5-gpt2small-lr2e4-batchmaxpossible-bs6-wsd-fastdecay`.
20
+
21
+ ## What this is
22
+
23
+ - model family: GPT-2-small-like decoder-only LM
24
+ - parameters: ~136M
25
+ - languages: English + Italian
26
+ - context length: 2500
27
+ - selected checkpoint: `step_10000.pt`
28
+ - selection reason: lowest recorded validation loss among saved checkpoints in `best_validation.json`
29
+
30
+ ## Best validation
31
+
32
+ - step: 10000
33
+ - validation loss: 3.8945770748
34
+ - validation perplexity: 49.1352684243
35
+ - validation batches: 128
36
+
37
+ ## Important caveat
38
+
39
+ This checkpoint is the best validation checkpoint **within this run family**. It is a useful intermediate bilingual pretraining artifact, not a polished factual assistant model.
40
+
41
+ ## Training/data provenance
42
+
43
+ - training config: `training_config.yaml`
44
+ - tokenizer: `tokenizer.json` + `tokenizer_meta.json`
45
+ - packed dataset root used by the run: `/mnt/apps/llm-nanochat/datasets/202605011052_fresh_50_50_score100_2500_sourcebalanced`
46
+ - tokenizer root used by the run: `/mnt/apps/llm-nanochat/tokenizers/tok_202605011052_fresh_50_50_score100_32k_fromscratch`
47
+
48
+ ## Included files
49
+
50
+ - `step_10000.pt`
51
+ - `step_10000.safetensors`
52
+ - `step_10000.safetensors.json`
53
+ - `training_config.yaml`
54
+ - `tokenizer.json`
55
+ - `tokenizer_meta.json`
56
+ - `best_validation.json`
57
+ - `eval_summary.json`
58
+ - `probe_step10000_summary.json`
59
+ - full run telemetry snapshots: `eval_metrics.jsonl`, `metrics.jsonl`, `probe_generations.jsonl`
60
+
61
+ ## Probe reading at step 10000
62
+
63
+ The run includes probe telemetry, but the stored payload for this experiment is legacy/partial: the `probe_generations.jsonl` entries at step `10000` keep prompts and expected continuations, while generated text / target-rank fields are null. So this release does **not** make strong probe-quality claims from those rows.
64
+
65
+ ## Usage
66
+
67
+ This project uses a custom NanoChat inference/training stack. The easiest local UI in the source repo is the Chainlit checkpoint tester documented in the repo README.
68
+
69
+ ## Limitations
70
+
71
+ - factual recall is still limited
72
+ - generations may become repetitive
73
+ - the model was selected by validation loss inside this run family, not by broad downstream benchmark performance
74
+ - dataset redistribution for the full training corpus may have separate licensing constraints; this repo contains model artifacts, not the raw/prepared training corpus
best_validation.json ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "step": 10000,
3
+ "validation_loss": 3.8945770747959614,
4
+ "validation_perplexity": 49.1352684243327,
5
+ "validation_num_batches": 128,
6
+ "elapsed_sec": 82998.0124297142,
7
+ "checkpoint_path": "/mnt/apps/llm-nanochat/checkpoints/20260517_stable-config-recipe-v5-gpt2small-lr2e4-batchmaxpossible-bs6-wsd-fastdecay/step_10000.pt"
8
+ }
eval_metrics.jsonl ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ {"step": 1000, "validation_loss": 5.782403785735369, "validation_perplexity": 324.5383742611778, "validation_num_batches": 128, "elapsed_sec": 8294.764070749283}
2
+ {"step": 2000, "validation_loss": 4.983233451843262, "validation_perplexity": 145.94552736410432, "validation_num_batches": 128, "elapsed_sec": 16590.22207093239}
3
+ {"step": 3000, "validation_loss": 4.501660106703639, "validation_perplexity": 90.16669345385398, "validation_num_batches": 128, "elapsed_sec": 24894.24841451645}
4
+ {"step": 4000, "validation_loss": 4.2645480167120695, "validation_perplexity": 71.13276188688639, "validation_num_batches": 128, "elapsed_sec": 33190.52204680443}
5
+ {"step": 5000, "validation_loss": 4.109950916841626, "validation_perplexity": 60.94372618564526, "validation_num_batches": 128, "elapsed_sec": 41488.19192171097}
6
+ {"step": 6000, "validation_loss": 3.989347394555807, "validation_perplexity": 54.01962435656213, "validation_num_batches": 128, "elapsed_sec": 49786.84053468704}
7
+ {"step": 7000, "validation_loss": 4.209565173834562, "validation_perplexity": 67.32725779094922, "validation_num_batches": 128, "elapsed_sec": 58085.096895217896}
8
+ {"step": 8000, "validation_loss": 4.077932074666023, "validation_perplexity": 59.023287809598706, "validation_num_batches": 128, "elapsed_sec": 66384.57489657402}
9
+ {"step": 9000, "validation_loss": 3.966189209371805, "validation_perplexity": 52.78300212187296, "validation_num_batches": 128, "elapsed_sec": 74691.24132275581}
10
+ {"step": 10000, "validation_loss": 3.8945770747959614, "validation_perplexity": 49.1352684243327, "validation_num_batches": 128, "elapsed_sec": 82998.0124297142}
eval_summary.json ADDED
@@ -0,0 +1,26 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "model_name": "gpt2small-en-it-nanochat-lr2e4-bs6-wsd-fastdecay-step10000",
3
+ "selected_checkpoint": "step_10000.pt",
4
+ "selection_reason": "best_validation.json minimum validation loss for this run",
5
+ "best_validation": {
6
+ "step": 10000,
7
+ "validation_loss": 3.8945770747959614,
8
+ "validation_perplexity": 49.1352684243327,
9
+ "validation_num_batches": 128,
10
+ "elapsed_sec": 82998.0124297142
11
+ },
12
+ "final_validation_step_10000": {
13
+ "step": 10000,
14
+ "validation_loss": 3.8945770747959614,
15
+ "validation_perplexity": 49.1352684243327,
16
+ "validation_num_batches": 128,
17
+ "elapsed_sec": 82998.0124297142
18
+ },
19
+ "notes": [
20
+ "This is the best saved checkpoint of the stable-config-recipe-v5-gpt2small-lr2e4-batchmaxpossible-bs6-wsd-fastdecay run.",
21
+ "For this run the final saved checkpoint step_10000.pt is also the best validation checkpoint.",
22
+ "Probe telemetry exists, but this run wrote legacy/null probe target fields, so probe quality claims are intentionally conservative."
23
+ ],
24
+ "tokenizer_dir": "/mnt/apps/llm-nanochat/tokenizers/tok_202605011052_fresh_50_50_score100_32k_fromscratch",
25
+ "dataset_dir": "/mnt/apps/llm-nanochat/datasets/202605011052_fresh_50_50_score100_2500_sourcebalanced"
26
+ }
metrics.jsonl ADDED
The diff for this file is too large to render. See raw diff
 
probe_generations.jsonl ADDED
The diff for this file is too large to render. See raw diff
 
probe_step10000_summary.json ADDED
@@ -0,0 +1,38 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [
2
+ {
3
+ "language": "en",
4
+ "prompt": "The capital of Italy is",
5
+ "expected_next_text": " Rome",
6
+ "completion": null,
7
+ "correct_token_rank": null,
8
+ "correct_token_probability": null,
9
+ "note": "legacy/null probe payload in this run"
10
+ },
11
+ {
12
+ "language": "en",
13
+ "prompt": "A small language model should",
14
+ "expected_next_text": " be",
15
+ "completion": null,
16
+ "correct_token_rank": null,
17
+ "correct_token_probability": null,
18
+ "note": "legacy/null probe payload in this run"
19
+ },
20
+ {
21
+ "language": "it",
22
+ "prompt": "La capitale d'Italia è",
23
+ "expected_next_text": " Roma",
24
+ "completion": null,
25
+ "correct_token_rank": null,
26
+ "correct_token_probability": null,
27
+ "note": "legacy/null probe payload in this run"
28
+ },
29
+ {
30
+ "language": "it",
31
+ "prompt": "Un piccolo modello linguistico dovrebbe",
32
+ "expected_next_text": " essere",
33
+ "completion": null,
34
+ "correct_token_rank": null,
35
+ "correct_token_probability": null,
36
+ "note": "legacy/null probe payload in this run"
37
+ }
38
+ ]
step_10000.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:9ff6ac107365b9e3820863926c8990299b4be6e52c9587ce7033fc165ad09e2c
3
+ size 1633717975
step_10000.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f8b68a5746314c08c2162f1e00bce5c26a70f71a71c4a4eadb1646b68568170f
3
+ size 544531144
step_10000.safetensors.json ADDED
@@ -0,0 +1,287 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "checkpoint_config": {
3
+ "actual_precision": "bf16",
4
+ "adamw_betas": [
5
+ 0.9,
6
+ 0.95
7
+ ],
8
+ "adamw_eps": 1e-08,
9
+ "attention_kernel_policy": "auto",
10
+ "batch_size": 6,
11
+ "benchmark": {
12
+ "enable_central_tensorboard": true,
13
+ "enable_local_tensorboard": true,
14
+ "enabled": false,
15
+ "output_path": "/mnt/apps/llm-nanochat/artifacts/runs/20260517_stable-config-recipe-v5-gpt2small-lr2e4-batchmaxpossible-bs6-wsd-fastdecay/throughput_benchmark.json",
16
+ "warmup_steps": 0
17
+ },
18
+ "checkpoint_dir": "/mnt/apps/llm-nanochat/checkpoints/20260517_stable-config-recipe-v5-gpt2small-lr2e4-batchmaxpossible-bs6-wsd-fastdecay",
19
+ "clip_grad_norm": 1.0,
20
+ "compile": {
21
+ "backend": null,
22
+ "compile_setup_sec": 0.0,
23
+ "diagnostic": null,
24
+ "dynamic": false,
25
+ "enabled": false,
26
+ "error_policy": "raise",
27
+ "fullgraph": false,
28
+ "mode": null,
29
+ "requested": false,
30
+ "status": "disabled"
31
+ },
32
+ "dataset": {
33
+ "storage_mode": "indexed_jsonl"
34
+ },
35
+ "decay_steps": 2500,
36
+ "deterministic_algorithms": false,
37
+ "device": "cuda",
38
+ "dim": 768,
39
+ "final_lr": 1e-05,
40
+ "fp8_backend": null,
41
+ "grad_accum_steps": 16,
42
+ "learning_rate": 0.0002,
43
+ "logging": {
44
+ "enable_central_tensorboard": true,
45
+ "enable_local_tensorboard": true,
46
+ "metrics_flush_every_steps": 1,
47
+ "metrics_writer": "persistent_jsonl_handle"
48
+ },
49
+ "lr": 0.0002,
50
+ "lr_schedule": "wsd",
51
+ "max_seq_len": 2500,
52
+ "max_steps": 10000,
53
+ "n_heads": 12,
54
+ "n_layers": 12,
55
+ "optimizer": {
56
+ "backend": "torch",
57
+ "betas": [
58
+ 0.9,
59
+ 0.95
60
+ ],
61
+ "eps": 1e-08,
62
+ "implementation": "torch.optim.AdamW",
63
+ "learning_rate": 0.0002,
64
+ "state_precision": "full_precision",
65
+ "type": "adamw",
66
+ "weight_decay": 0.1
67
+ },
68
+ "optimizer_backend": "torch",
69
+ "optimizer_implementation": "torch.optim.AdamW",
70
+ "optimizer_state_precision": "full_precision",
71
+ "optimizer_type": "adamw",
72
+ "peak_lr": 0.0002,
73
+ "repro": {
74
+ "attention_kernel_policy": "auto",
75
+ "cublas_workspace_config": null,
76
+ "cudnn_benchmark": true,
77
+ "cudnn_deterministic": false,
78
+ "deterministic_algorithms": false,
79
+ "flash_sdp_enabled": true,
80
+ "math_sdp_enabled": true,
81
+ "mem_efficient_sdp_enabled": true,
82
+ "pythonhashseed": "1337",
83
+ "seed": 1337
84
+ },
85
+ "requested_precision": "bf16",
86
+ "save_every_steps": 500,
87
+ "scheduler": {
88
+ "decay_steps": 2500,
89
+ "final_lr": 1e-05,
90
+ "peak_lr": 0.0002,
91
+ "schedule_type": "wsd",
92
+ "stable_steps": 7000,
93
+ "total_steps": 10000,
94
+ "warmup_steps": 500
95
+ },
96
+ "seed": 1337,
97
+ "stable_steps": 7000,
98
+ "train_cache_ram_bytes": 1073741824,
99
+ "train_cache_ram_mb": 1024,
100
+ "vocab_size": 32000,
101
+ "warmup_steps": 500,
102
+ "weight_decay": 0.1
103
+ },
104
+ "checkpoint_path": "/mnt/apps/llm-nanochat/checkpoints/20260517_stable-config-recipe-v5-gpt2small-lr2e4-batchmaxpossible-bs6-wsd-fastdecay/step_10000.pt",
105
+ "exported_at": "2026-05-19T13:49:55.443068+00:00",
106
+ "format": "llm-nanochat-safetensors-export",
107
+ "global_step": 10000,
108
+ "metadata_path": "/mnt/apps/llm-nanochat/hf_exports/gpt2small-en-it-nanochat-lr2e4-bs6-wsd-fastdecay-step10000/step_10000.safetensors.json",
109
+ "model_config": {
110
+ "dim": 768,
111
+ "max_seq_len": 2500,
112
+ "n_heads": 12,
113
+ "n_layers": 12,
114
+ "vocab_size": 32000
115
+ },
116
+ "num_parameters": 136128000,
117
+ "num_tensors": 149,
118
+ "provenance": {
119
+ "checkpoint_dir": "/mnt/apps/llm-nanochat/checkpoints/20260517_stable-config-recipe-v5-gpt2small-lr2e4-batchmaxpossible-bs6-wsd-fastdecay",
120
+ "checkpoint_name": "step_10000.pt",
121
+ "checkpoint_path": "/mnt/apps/llm-nanochat/checkpoints/20260517_stable-config-recipe-v5-gpt2small-lr2e4-batchmaxpossible-bs6-wsd-fastdecay/step_10000.pt",
122
+ "global_step": 10000,
123
+ "packed_dataset_config_path": null,
124
+ "run_dir": "/mnt/apps/llm-nanochat/artifacts/runs/20260517_stable-config-recipe-v5-gpt2small-lr2e4-batchmaxpossible-bs6-wsd-fastdecay",
125
+ "tokenizer_dir": "/mnt/apps/llm-nanochat/tokenizers/tok_202605011052_fresh_50_50_score100_32k_fromscratch",
126
+ "training_config_path": "/home/descanso/.openclaw/workspace/python_project/llm-nanochat/configs/testing/20260517_stable-config-recipe-v5-gpt2small-lr2e4-batchmaxpossible-bs6-wsd-fastdecay.yaml"
127
+ },
128
+ "safetensors_path": "/mnt/apps/llm-nanochat/hf_exports/gpt2small-en-it-nanochat-lr2e4-bs6-wsd-fastdecay-step10000/step_10000.safetensors",
129
+ "source_checkpoint_path": "/mnt/apps/llm-nanochat/checkpoints/20260517_stable-config-recipe-v5-gpt2small-lr2e4-batchmaxpossible-bs6-wsd-fastdecay/step_10000.pt",
130
+ "source_global_step": 10000,
131
+ "tensor_names": [
132
+ "token_emb.weight",
133
+ "pos_emb.weight",
134
+ "blocks.layers.0.self_attn.in_proj_weight",
135
+ "blocks.layers.0.self_attn.in_proj_bias",
136
+ "blocks.layers.0.self_attn.out_proj.weight",
137
+ "blocks.layers.0.self_attn.out_proj.bias",
138
+ "blocks.layers.0.linear1.weight",
139
+ "blocks.layers.0.linear1.bias",
140
+ "blocks.layers.0.linear2.weight",
141
+ "blocks.layers.0.linear2.bias",
142
+ "blocks.layers.0.norm1.weight",
143
+ "blocks.layers.0.norm1.bias",
144
+ "blocks.layers.0.norm2.weight",
145
+ "blocks.layers.0.norm2.bias",
146
+ "blocks.layers.1.self_attn.in_proj_weight",
147
+ "blocks.layers.1.self_attn.in_proj_bias",
148
+ "blocks.layers.1.self_attn.out_proj.weight",
149
+ "blocks.layers.1.self_attn.out_proj.bias",
150
+ "blocks.layers.1.linear1.weight",
151
+ "blocks.layers.1.linear1.bias",
152
+ "blocks.layers.1.linear2.weight",
153
+ "blocks.layers.1.linear2.bias",
154
+ "blocks.layers.1.norm1.weight",
155
+ "blocks.layers.1.norm1.bias",
156
+ "blocks.layers.1.norm2.weight",
157
+ "blocks.layers.1.norm2.bias",
158
+ "blocks.layers.2.self_attn.in_proj_weight",
159
+ "blocks.layers.2.self_attn.in_proj_bias",
160
+ "blocks.layers.2.self_attn.out_proj.weight",
161
+ "blocks.layers.2.self_attn.out_proj.bias",
162
+ "blocks.layers.2.linear1.weight",
163
+ "blocks.layers.2.linear1.bias",
164
+ "blocks.layers.2.linear2.weight",
165
+ "blocks.layers.2.linear2.bias",
166
+ "blocks.layers.2.norm1.weight",
167
+ "blocks.layers.2.norm1.bias",
168
+ "blocks.layers.2.norm2.weight",
169
+ "blocks.layers.2.norm2.bias",
170
+ "blocks.layers.3.self_attn.in_proj_weight",
171
+ "blocks.layers.3.self_attn.in_proj_bias",
172
+ "blocks.layers.3.self_attn.out_proj.weight",
173
+ "blocks.layers.3.self_attn.out_proj.bias",
174
+ "blocks.layers.3.linear1.weight",
175
+ "blocks.layers.3.linear1.bias",
176
+ "blocks.layers.3.linear2.weight",
177
+ "blocks.layers.3.linear2.bias",
178
+ "blocks.layers.3.norm1.weight",
179
+ "blocks.layers.3.norm1.bias",
180
+ "blocks.layers.3.norm2.weight",
181
+ "blocks.layers.3.norm2.bias",
182
+ "blocks.layers.4.self_attn.in_proj_weight",
183
+ "blocks.layers.4.self_attn.in_proj_bias",
184
+ "blocks.layers.4.self_attn.out_proj.weight",
185
+ "blocks.layers.4.self_attn.out_proj.bias",
186
+ "blocks.layers.4.linear1.weight",
187
+ "blocks.layers.4.linear1.bias",
188
+ "blocks.layers.4.linear2.weight",
189
+ "blocks.layers.4.linear2.bias",
190
+ "blocks.layers.4.norm1.weight",
191
+ "blocks.layers.4.norm1.bias",
192
+ "blocks.layers.4.norm2.weight",
193
+ "blocks.layers.4.norm2.bias",
194
+ "blocks.layers.5.self_attn.in_proj_weight",
195
+ "blocks.layers.5.self_attn.in_proj_bias",
196
+ "blocks.layers.5.self_attn.out_proj.weight",
197
+ "blocks.layers.5.self_attn.out_proj.bias",
198
+ "blocks.layers.5.linear1.weight",
199
+ "blocks.layers.5.linear1.bias",
200
+ "blocks.layers.5.linear2.weight",
201
+ "blocks.layers.5.linear2.bias",
202
+ "blocks.layers.5.norm1.weight",
203
+ "blocks.layers.5.norm1.bias",
204
+ "blocks.layers.5.norm2.weight",
205
+ "blocks.layers.5.norm2.bias",
206
+ "blocks.layers.6.self_attn.in_proj_weight",
207
+ "blocks.layers.6.self_attn.in_proj_bias",
208
+ "blocks.layers.6.self_attn.out_proj.weight",
209
+ "blocks.layers.6.self_attn.out_proj.bias",
210
+ "blocks.layers.6.linear1.weight",
211
+ "blocks.layers.6.linear1.bias",
212
+ "blocks.layers.6.linear2.weight",
213
+ "blocks.layers.6.linear2.bias",
214
+ "blocks.layers.6.norm1.weight",
215
+ "blocks.layers.6.norm1.bias",
216
+ "blocks.layers.6.norm2.weight",
217
+ "blocks.layers.6.norm2.bias",
218
+ "blocks.layers.7.self_attn.in_proj_weight",
219
+ "blocks.layers.7.self_attn.in_proj_bias",
220
+ "blocks.layers.7.self_attn.out_proj.weight",
221
+ "blocks.layers.7.self_attn.out_proj.bias",
222
+ "blocks.layers.7.linear1.weight",
223
+ "blocks.layers.7.linear1.bias",
224
+ "blocks.layers.7.linear2.weight",
225
+ "blocks.layers.7.linear2.bias",
226
+ "blocks.layers.7.norm1.weight",
227
+ "blocks.layers.7.norm1.bias",
228
+ "blocks.layers.7.norm2.weight",
229
+ "blocks.layers.7.norm2.bias",
230
+ "blocks.layers.8.self_attn.in_proj_weight",
231
+ "blocks.layers.8.self_attn.in_proj_bias",
232
+ "blocks.layers.8.self_attn.out_proj.weight",
233
+ "blocks.layers.8.self_attn.out_proj.bias",
234
+ "blocks.layers.8.linear1.weight",
235
+ "blocks.layers.8.linear1.bias",
236
+ "blocks.layers.8.linear2.weight",
237
+ "blocks.layers.8.linear2.bias",
238
+ "blocks.layers.8.norm1.weight",
239
+ "blocks.layers.8.norm1.bias",
240
+ "blocks.layers.8.norm2.weight",
241
+ "blocks.layers.8.norm2.bias",
242
+ "blocks.layers.9.self_attn.in_proj_weight",
243
+ "blocks.layers.9.self_attn.in_proj_bias",
244
+ "blocks.layers.9.self_attn.out_proj.weight",
245
+ "blocks.layers.9.self_attn.out_proj.bias",
246
+ "blocks.layers.9.linear1.weight",
247
+ "blocks.layers.9.linear1.bias",
248
+ "blocks.layers.9.linear2.weight",
249
+ "blocks.layers.9.linear2.bias",
250
+ "blocks.layers.9.norm1.weight",
251
+ "blocks.layers.9.norm1.bias",
252
+ "blocks.layers.9.norm2.weight",
253
+ "blocks.layers.9.norm2.bias",
254
+ "blocks.layers.10.self_attn.in_proj_weight",
255
+ "blocks.layers.10.self_attn.in_proj_bias",
256
+ "blocks.layers.10.self_attn.out_proj.weight",
257
+ "blocks.layers.10.self_attn.out_proj.bias",
258
+ "blocks.layers.10.linear1.weight",
259
+ "blocks.layers.10.linear1.bias",
260
+ "blocks.layers.10.linear2.weight",
261
+ "blocks.layers.10.linear2.bias",
262
+ "blocks.layers.10.norm1.weight",
263
+ "blocks.layers.10.norm1.bias",
264
+ "blocks.layers.10.norm2.weight",
265
+ "blocks.layers.10.norm2.bias",
266
+ "blocks.layers.11.self_attn.in_proj_weight",
267
+ "blocks.layers.11.self_attn.in_proj_bias",
268
+ "blocks.layers.11.self_attn.out_proj.weight",
269
+ "blocks.layers.11.self_attn.out_proj.bias",
270
+ "blocks.layers.11.linear1.weight",
271
+ "blocks.layers.11.linear1.bias",
272
+ "blocks.layers.11.linear2.weight",
273
+ "blocks.layers.11.linear2.bias",
274
+ "blocks.layers.11.norm1.weight",
275
+ "blocks.layers.11.norm1.bias",
276
+ "blocks.layers.11.norm2.weight",
277
+ "blocks.layers.11.norm2.bias",
278
+ "ln_f.weight",
279
+ "ln_f.bias",
280
+ "head.weight"
281
+ ],
282
+ "tokenizer_reference": {
283
+ "packed_dataset_config_path": null,
284
+ "tokenizer_dir": "/mnt/apps/llm-nanochat/tokenizers/tok_202605011052_fresh_50_50_score100_32k_fromscratch",
285
+ "training_config_path": "/home/descanso/.openclaw/workspace/python_project/llm-nanochat/configs/testing/20260517_stable-config-recipe-v5-gpt2small-lr2e4-batchmaxpossible-bs6-wsd-fastdecay.yaml"
286
+ }
287
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_meta.json ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "vocab_size_requested": 32000,
3
+ "vocab_size_actual": 32000,
4
+ "special_tokens": [
5
+ "<pad>",
6
+ "<bos>",
7
+ "<eos>",
8
+ "<unk>"
9
+ ]
10
+ }
training_config.yaml ADDED
@@ -0,0 +1,58 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Experimental WSD variant: longer stable phase, shorter decay.
2
+ # Goal: reduce time spent in the suspected unstable LR band ~6e-5 -> 1e-4.
3
+ # Keep experiment-only variants under configs/testing/.
4
+
5
+ dataset_dir: /mnt/apps/llm-nanochat/datasets/202605011052_fresh_50_50_score100_2500_sourcebalanced
6
+ output_dir: /mnt/apps/llm-nanochat/artifacts/runs/20260517_stable-config-recipe-v5-gpt2small-lr2e4-batchmaxpossible-bs6-wsd-fastdecay
7
+ tokenizer_dir: /mnt/apps/llm-nanochat/tokenizers/tok_202605011052_fresh_50_50_score100_32k_fromscratch
8
+ seed: 1337
9
+
10
+ model:
11
+ vocab_size: 32000
12
+ dim: 768
13
+ n_layers: 12
14
+ n_heads: 12
15
+
16
+ training:
17
+ sequence_length: 2500
18
+ max_steps: 10000
19
+ batch_size: 6
20
+ grad_accum_steps: 16
21
+
22
+ learning_rate: 0.0002
23
+ peak_lr: 0.0002
24
+ lr_schedule: wsd
25
+
26
+ warmup_steps: 500
27
+ stable_steps: 7000
28
+ decay_steps: 2500
29
+ final_lr: 1.0e-05
30
+
31
+ adamw_betas:
32
+ - 0.9
33
+ - 0.95
34
+ adamw_eps: 1.0e-08
35
+ weight_decay: 0.1
36
+ clip_grad_norm: 1.0
37
+
38
+ save_every_steps: 500
39
+ checkpoint_dir: /mnt/apps/llm-nanochat/checkpoints/20260517_stable-config-recipe-v5-gpt2small-lr2e4-batchmaxpossible-bs6-wsd-fastdecay
40
+ precision: bf16
41
+
42
+ evaluation:
43
+ validation_every_steps: 1000
44
+ validation_max_batches: 128
45
+ probe_every_steps: 1000
46
+ probe_tokenizer_dir: /mnt/apps/llm-nanochat/tokenizers/tok_202605011052_fresh_50_50_score100_32k_fromscratch
47
+ probe_max_new_tokens: 32
48
+ probe_prompts:
49
+ en:
50
+ - prompt: "The capital of Italy is"
51
+ expected_next_text: " Rome"
52
+ - prompt: "A small language model should"
53
+ expected_next_text: " be"
54
+ it:
55
+ - prompt: "La capitale d'Italia è"
56
+ expected_next_text: " Roma"
57
+ - prompt: "Un piccolo modello linguistico dovrebbe"
58
+ expected_next_text: " essere"