kth8 commited on
Commit
83772e2
·
verified ·
1 Parent(s): c5a7299

Upload folder using huggingface_hub

Browse files
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ tokenizer.json filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,143 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: gemma
3
+ language:
4
+ - en
5
+ base_model: unsloth/gemma-3-270m-it
6
+ datasets:
7
+ - kth8/title-generation-25000x
8
+ pipeline_tag: text-generation
9
+ library_name: transformers
10
+ tags:
11
+ - sft
12
+ - trl
13
+ - unsloth
14
+ - gemma
15
+ - gemma3
16
+ - gemma3_text
17
+ ---
18
+ ![logo](https://storage.googleapis.com/gweb-developer-goog-blog-assets/images/gemma-3_2.original.png)
19
+ A supervised fine-tune of [unsloth/gemma-3-270m-it](https://huggingface.co/unsloth/gemma-3-270m-it) on the [kth8/title-generation-25000x](https://huggingface.co/datasets/kth8/title-generation-25000x) dataset.
20
+ Trained on ~17K example English only subset due to base model limitation and with the exact system prompt OpenCode's [title agent uses](https://raw.githubusercontent.com/anomalyco/opencode/refs/heads/dev/packages/opencode/src/agent/prompt/title.txt).
21
+
22
+ ## Usage example
23
+
24
+ **System prompt**
25
+ ```
26
+ You are a title generator. You output ONLY a thread title. Nothing else.
27
+
28
+ <task>
29
+ Generate a brief title that would help the user find this conversation later.
30
+
31
+ Follow all rules in <rules>
32
+ Use the <examples> so you know what a good title looks like.
33
+ Your output must be:
34
+ - A single line
35
+ - ≤50 characters
36
+ - No explanations
37
+ </task>
38
+
39
+ <rules>
40
+ - you MUST use the same language as the user message you are summarizing
41
+ - Title must be grammatically correct and read naturally - no word salad
42
+ - Never include tool names in the title (e.g. "read tool", "bash tool", "edit tool")
43
+ - Focus on the main topic or question the user needs to retrieve
44
+ - Vary your phrasing - avoid repetitive patterns like always starting with "Analyzing"
45
+ - When a file is mentioned, focus on WHAT the user wants to do WITH the file, not just that they shared it
46
+ - Keep exact: technical terms, numbers, filenames, HTTP codes
47
+ - Remove: the, this, my, a, an
48
+ - Never assume tech stack
49
+ - Never use tools
50
+ - NEVER respond to questions, just generate a title for the conversation
51
+ - The title should NEVER include "summarizing" or "generating" when generating a title
52
+ - DO NOT SAY YOU CANNOT GENERATE A TITLE OR COMPLAIN ABOUT THE INPUT
53
+ - Always output something meaningful, even if the input is minimal.
54
+ - If the user message is short or conversational (e.g. "hello", "lol", "what's up", "hey"):
55
+ → create a title that reflects the user's tone or intent (such as Greeting, Quick check-in, Light chat, Intro message, etc.)
56
+ </rules>
57
+
58
+ <examples>
59
+ "debug 500 errors in production" → Debugging production 500 errors
60
+ "refactor user service" → Refactoring user service
61
+ "why is app.js failing" → app.js failure investigation
62
+ "implement rate limiting" → Rate limiting implementation
63
+ "how do I connect postgres to my API" → Postgres API connection
64
+ "best practices for React hooks" → React hooks best practices
65
+ "@src/auth.ts can you add refresh token support" → Auth refresh token support
66
+ "@utils/parser.ts this is broken" → Parser bug fix
67
+ "look at @config.json" → Config review
68
+ "@App.tsx add dark mode toggle" → Dark mode toggle in App
69
+ </examples>
70
+ ```
71
+ **User prompt**
72
+ ```
73
+ If there were 200 students who passed an English course three years ago, and each subsequent year until the current one that number increased by 50% of the previous year's number, how many students will pass the course this year?
74
+ ```
75
+ **Assistant response**
76
+ ```
77
+ Student course passing growth calculation
78
+ ```
79
+ ## Model Details
80
+ - Base Model: `unsloth/gemma-3-270m-it`
81
+ - Parameter Count: 268,098,176
82
+ - Precision: torch.bfloat16
83
+
84
+ ## Training Settings
85
+
86
+ ### PEFT
87
+ - Rank: 32
88
+ - LoRA alpha: 64
89
+ - Modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
90
+ - Gradient checkpointing: unsloth
91
+
92
+ ### SFT
93
+ - Epoch: 1
94
+ - Batch size: 8
95
+ - Gradient Accumulation steps: 2
96
+ - Learning rate: 0.0002
97
+ - Optimizer: adamw_torch_fused
98
+ - Learning rate scheduler: cosine
99
+ - Warmup steps: 100
100
+ - Weight decay: 0.01
101
+
102
+ ## Training stats
103
+ - Date: 2026-05-27T10:31:30.454060
104
+ - GPU: NVIDIA A100-SXM4-40GB
105
+ - Peak VRAM usage: 12.152 GB
106
+ - Global step: 1086
107
+ - Training runtime (seconds): 1140.8033
108
+ - Average training loss: 1.5016714283994108
109
+ - Best validation loss: 1.2990461587905884
110
+
111
+ | Step | Training Loss | Validation Loss |
112
+ |------|---------------|-----------------|
113
+ | 54 | 1.634000 | 1.755292 |
114
+ | 108 | 1.783900 | 1.585450 |
115
+ | 162 | 1.609100 | 1.580338 |
116
+ | 216 | 1.534900 | 1.548727 |
117
+ | 270 | 1.485100 | 1.522543 |
118
+ | 324 | 1.549500 | 1.483723 |
119
+ | 378 | 1.512200 | 1.459690 |
120
+ | 432 | 1.451300 | 1.432863 |
121
+ | 486 | 1.502300 | 1.439751 |
122
+ | 540 | 1.376700 | 1.425881 |
123
+ | 594 | 1.442000 | 1.390692 |
124
+ | 648 | 1.365000 | 1.359873 |
125
+ | 702 | 1.334400 | 1.337866 |
126
+ | 756 | 1.376700 | 1.324850 |
127
+ | 810 | 1.355800 | 1.325707 |
128
+ | 864 | 1.327700 | 1.317618 |
129
+ | 918 | 1.423100 | 1.310045 |
130
+ | 972 | 1.400300 | 1.303569 |
131
+ | 1026 | 1.257700 | 1.299046 |
132
+ | 1080 | 1.278000 | 1.299577 |
133
+
134
+ ## Framework versions
135
+ - Unsloth: 2026.5.8
136
+ - TRL: 0.22.2
137
+ - Transformers: 4.56.2
138
+ - Pytorch: 2.11.0+cu128
139
+ - Datasets: 4.8.5
140
+ - Tokenizers: 0.22.2
141
+
142
+ ## License
143
+ This model is released under the Gemma license. See the [Gemma Terms of Use](https://ai.google.dev/gemma/terms) and [Prohibited Use Policy](https://policies.google.com/terms/generative-ai/use-policy) regarding the use of Gemma-generated content.
added_tokens.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ {
2
+ "<image_soft_token>": 262144
3
+ }
chat_template.jinja ADDED
@@ -0,0 +1,50 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {# Unsloth Chat template fixes #}
2
+ {{ bos_token }}
3
+ {%- if messages[0]['role'] == 'system' -%}
4
+ {%- if messages[0]['content'] is string -%}
5
+ {%- set first_user_prefix = messages[0]['content'] + '
6
+
7
+ ' -%}
8
+ {%- else -%}
9
+ {%- set first_user_prefix = messages[0]['content'][0]['text'] + '
10
+
11
+ ' -%}
12
+ {%- endif -%}
13
+ {%- set loop_messages = messages[1:] -%}
14
+ {%- else -%}
15
+ {%- set first_user_prefix = "" -%}
16
+ {%- set loop_messages = messages -%}
17
+ {%- endif -%}
18
+ {%- for message in loop_messages -%}
19
+ {%- if (message['role'] == 'user') != (loop.index0 % 2 == 0) -%}
20
+ {{ raise_exception("Conversation roles must alternate user/assistant/user/assistant/...") }}
21
+ {%- endif -%}
22
+ {%- if (message['role'] == 'assistant') -%}
23
+ {%- set role = "model" -%}
24
+ {%- else -%}
25
+ {%- set role = message['role'] -%}
26
+ {%- endif -%}
27
+ {{ '<start_of_turn>' + role + '
28
+ ' + (first_user_prefix if loop.first else "") }}
29
+ {%- if message['content'] is string -%}
30
+ {{ message['content'] | trim }}
31
+ {%- elif message['content'] is iterable -%}
32
+ {%- for item in message['content'] -%}
33
+ {%- if item['type'] == 'image' -%}
34
+ {{ '<start_of_image>' }}
35
+ {%- elif item['type'] == 'text' -%}
36
+ {{ item['text'] | trim }}
37
+ {%- endif -%}
38
+ {%- endfor -%}
39
+ {%- elif message['content'] is defined -%}
40
+ {{ raise_exception("Invalid content type") }}
41
+ {%- endif -%}
42
+ {{ '<end_of_turn>
43
+ ' }}
44
+ {%- endfor -%}
45
+ {%- if add_generation_prompt -%}
46
+ {{'<start_of_turn>model
47
+ '}}
48
+ {%- endif -%}
49
+
50
+ {# Copyright 2025-present Unsloth. Apache 2.0 License. #}
config.json ADDED
@@ -0,0 +1,55 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_sliding_window_pattern": 6,
3
+ "architectures": [
4
+ "Gemma3ForCausalLM"
5
+ ],
6
+ "attention_bias": false,
7
+ "attention_dropout": 0.0,
8
+ "attn_logit_softcapping": null,
9
+ "bos_token_id": 2,
10
+ "dtype": "bfloat16",
11
+ "eos_token_id": 106,
12
+ "final_logit_softcapping": null,
13
+ "head_dim": 256,
14
+ "hidden_activation": "gelu_pytorch_tanh",
15
+ "hidden_size": 640,
16
+ "initializer_range": 0.02,
17
+ "intermediate_size": 2048,
18
+ "layer_types": [
19
+ "sliding_attention",
20
+ "sliding_attention",
21
+ "sliding_attention",
22
+ "sliding_attention",
23
+ "sliding_attention",
24
+ "full_attention",
25
+ "sliding_attention",
26
+ "sliding_attention",
27
+ "sliding_attention",
28
+ "sliding_attention",
29
+ "sliding_attention",
30
+ "full_attention",
31
+ "sliding_attention",
32
+ "sliding_attention",
33
+ "sliding_attention",
34
+ "sliding_attention",
35
+ "sliding_attention",
36
+ "full_attention"
37
+ ],
38
+ "max_position_embeddings": 32768,
39
+ "model_type": "gemma3_text",
40
+ "num_attention_heads": 4,
41
+ "num_hidden_layers": 18,
42
+ "num_key_value_heads": 1,
43
+ "pad_token_id": 0,
44
+ "query_pre_attn_scalar": 256,
45
+ "rms_norm_eps": 1e-06,
46
+ "rope_local_base_freq": 10000.0,
47
+ "rope_scaling": null,
48
+ "rope_theta": 1000000.0,
49
+ "sliding_window": 512,
50
+ "transformers_version": "4.56.2",
51
+ "unsloth_fixed": true,
52
+ "use_bidirectional_attention": false,
53
+ "use_cache": true,
54
+ "vocab_size": 262144
55
+ }
generation_config.json ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token_id": 2,
3
+ "cache_implementation": "hybrid",
4
+ "do_sample": true,
5
+ "eos_token_id": [
6
+ 1,
7
+ 106
8
+ ],
9
+ "max_length": 32768,
10
+ "pad_token_id": 0,
11
+ "top_k": 64,
12
+ "top_p": 0.95,
13
+ "transformers_version": "4.56.2"
14
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:79a0631e78cdf084899625dd6104d857efe7a866e624bc028c23e1f2b1ec6fbe
3
+ size 536223056
special_tokens_map.json ADDED
@@ -0,0 +1,33 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "boi_token": "<start_of_image>",
3
+ "bos_token": {
4
+ "content": "<bos>",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false
9
+ },
10
+ "eoi_token": "<end_of_image>",
11
+ "eos_token": {
12
+ "content": "<end_of_turn>",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false
17
+ },
18
+ "image_token": "<image_soft_token>",
19
+ "pad_token": {
20
+ "content": "<pad>",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false
25
+ },
26
+ "unk_token": {
27
+ "content": "<unk>",
28
+ "lstrip": false,
29
+ "normalized": false,
30
+ "rstrip": false,
31
+ "single_word": false
32
+ }
33
+ }
tokenizer.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:4667f2089529e8e7657cfb6d1c19910ae71ff5f28aa7ab2ff2763330affad795
3
+ size 33384568
tokenizer.model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:1299c11d7cf632ef3b4e11937501358ada021bbdf7c47638d13c0ee982f2e79c
3
+ size 4689074
tokenizer_config.json ADDED
The diff for this file is too large to render. See raw diff
 
train/log.json ADDED
@@ -0,0 +1,927 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [
2
+ {
3
+ "loss": 4.5928,
4
+ "grad_norm": 26.88880729675293,
5
+ "learning_rate": 1.8e-05,
6
+ "epoch": 0.009208103130755065,
7
+ "step": 10
8
+ },
9
+ {
10
+ "loss": 2.6528,
11
+ "grad_norm": 17.780437469482422,
12
+ "learning_rate": 3.8e-05,
13
+ "epoch": 0.01841620626151013,
14
+ "step": 20
15
+ },
16
+ {
17
+ "loss": 2.0058,
18
+ "grad_norm": 11.103266716003418,
19
+ "learning_rate": 5.8e-05,
20
+ "epoch": 0.027624309392265192,
21
+ "step": 30
22
+ },
23
+ {
24
+ "loss": 1.933,
25
+ "grad_norm": 9.816450119018555,
26
+ "learning_rate": 7.800000000000001e-05,
27
+ "epoch": 0.03683241252302026,
28
+ "step": 40
29
+ },
30
+ {
31
+ "loss": 1.634,
32
+ "grad_norm": 10.19833755493164,
33
+ "learning_rate": 9.8e-05,
34
+ "epoch": 0.04604051565377532,
35
+ "step": 50
36
+ },
37
+ {
38
+ "eval_loss": 1.755292296409607,
39
+ "eval_runtime": 29.156,
40
+ "eval_samples_per_second": 12.176,
41
+ "eval_steps_per_second": 3.053,
42
+ "epoch": 0.049723756906077346,
43
+ "step": 54
44
+ },
45
+ {
46
+ "loss": 1.8678,
47
+ "grad_norm": 10.386731147766113,
48
+ "learning_rate": 0.000118,
49
+ "epoch": 0.055248618784530384,
50
+ "step": 60
51
+ },
52
+ {
53
+ "loss": 1.8065,
54
+ "grad_norm": 9.946635246276855,
55
+ "learning_rate": 0.000138,
56
+ "epoch": 0.06445672191528545,
57
+ "step": 70
58
+ },
59
+ {
60
+ "loss": 1.662,
61
+ "grad_norm": 7.8996901512146,
62
+ "learning_rate": 0.00015800000000000002,
63
+ "epoch": 0.07366482504604052,
64
+ "step": 80
65
+ },
66
+ {
67
+ "loss": 1.7389,
68
+ "grad_norm": 9.132104873657227,
69
+ "learning_rate": 0.00017800000000000002,
70
+ "epoch": 0.08287292817679558,
71
+ "step": 90
72
+ },
73
+ {
74
+ "loss": 1.7839,
75
+ "grad_norm": 6.549439430236816,
76
+ "learning_rate": 0.00019800000000000002,
77
+ "epoch": 0.09208103130755065,
78
+ "step": 100
79
+ },
80
+ {
81
+ "eval_loss": 1.5854495763778687,
82
+ "eval_runtime": 9.738,
83
+ "eval_samples_per_second": 36.455,
84
+ "eval_steps_per_second": 9.139,
85
+ "epoch": 0.09944751381215469,
86
+ "step": 108
87
+ },
88
+ {
89
+ "loss": 1.6674,
90
+ "grad_norm": 6.268077373504639,
91
+ "learning_rate": 0.0001999588877563566,
92
+ "epoch": 0.10128913443830571,
93
+ "step": 110
94
+ },
95
+ {
96
+ "loss": 1.7546,
97
+ "grad_norm": 6.57451868057251,
98
+ "learning_rate": 0.0001998168147576911,
99
+ "epoch": 0.11049723756906077,
100
+ "step": 120
101
+ },
102
+ {
103
+ "loss": 1.7248,
104
+ "grad_norm": 7.216021537780762,
105
+ "learning_rate": 0.00019957341762950344,
106
+ "epoch": 0.11970534069981584,
107
+ "step": 130
108
+ },
109
+ {
110
+ "loss": 1.5602,
111
+ "grad_norm": 8.38607406616211,
112
+ "learning_rate": 0.00019922894344441571,
113
+ "epoch": 0.1289134438305709,
114
+ "step": 140
115
+ },
116
+ {
117
+ "loss": 1.4979,
118
+ "grad_norm": 5.186763286590576,
119
+ "learning_rate": 0.0001987837418784522,
120
+ "epoch": 0.13812154696132597,
121
+ "step": 150
122
+ },
123
+ {
124
+ "loss": 1.6091,
125
+ "grad_norm": 5.746674060821533,
126
+ "learning_rate": 0.0001982382648560832,
127
+ "epoch": 0.14732965009208104,
128
+ "step": 160
129
+ },
130
+ {
131
+ "eval_loss": 1.5803377628326416,
132
+ "eval_runtime": 9.8217,
133
+ "eval_samples_per_second": 36.145,
134
+ "eval_steps_per_second": 9.062,
135
+ "epoch": 0.14917127071823205,
136
+ "step": 162
137
+ },
138
+ {
139
+ "loss": 1.5429,
140
+ "grad_norm": 4.997202396392822,
141
+ "learning_rate": 0.00019759306609147596,
142
+ "epoch": 0.15653775322283608,
143
+ "step": 170
144
+ },
145
+ {
146
+ "loss": 1.748,
147
+ "grad_norm": 6.101498603820801,
148
+ "learning_rate": 0.00019684880052641917,
149
+ "epoch": 0.16574585635359115,
150
+ "step": 180
151
+ },
152
+ {
153
+ "loss": 1.4566,
154
+ "grad_norm": 6.071585655212402,
155
+ "learning_rate": 0.0001960062236654908,
156
+ "epoch": 0.17495395948434622,
157
+ "step": 190
158
+ },
159
+ {
160
+ "loss": 1.7097,
161
+ "grad_norm": 4.875613689422607,
162
+ "learning_rate": 0.000195066190809145,
163
+ "epoch": 0.1841620626151013,
164
+ "step": 200
165
+ },
166
+ {
167
+ "loss": 1.5349,
168
+ "grad_norm": 7.088517665863037,
169
+ "learning_rate": 0.00019402965618549575,
170
+ "epoch": 0.19337016574585636,
171
+ "step": 210
172
+ },
173
+ {
174
+ "eval_loss": 1.5487271547317505,
175
+ "eval_runtime": 9.6971,
176
+ "eval_samples_per_second": 36.609,
177
+ "eval_steps_per_second": 9.178,
178
+ "epoch": 0.19889502762430938,
179
+ "step": 216
180
+ },
181
+ {
182
+ "loss": 1.6348,
183
+ "grad_norm": 6.564614295959473,
184
+ "learning_rate": 0.00019289767198167916,
185
+ "epoch": 0.20257826887661143,
186
+ "step": 220
187
+ },
188
+ {
189
+ "loss": 1.4628,
190
+ "grad_norm": 6.782557010650635,
191
+ "learning_rate": 0.0001916713872757776,
192
+ "epoch": 0.21178637200736647,
193
+ "step": 230
194
+ },
195
+ {
196
+ "loss": 1.4765,
197
+ "grad_norm": 4.291659355163574,
198
+ "learning_rate": 0.00019035204687038943,
199
+ "epoch": 0.22099447513812154,
200
+ "step": 240
201
+ },
202
+ {
203
+ "loss": 1.614,
204
+ "grad_norm": 4.750514984130859,
205
+ "learning_rate": 0.00018894099002902896,
206
+ "epoch": 0.2302025782688766,
207
+ "step": 250
208
+ },
209
+ {
210
+ "loss": 1.6338,
211
+ "grad_norm": 4.737514019012451,
212
+ "learning_rate": 0.00018743964911663893,
213
+ "epoch": 0.23941068139963168,
214
+ "step": 260
215
+ },
216
+ {
217
+ "loss": 1.4851,
218
+ "grad_norm": 4.858434677124023,
219
+ "learning_rate": 0.00018584954814559578,
220
+ "epoch": 0.24861878453038674,
221
+ "step": 270
222
+ },
223
+ {
224
+ "eval_loss": 1.5225430727005005,
225
+ "eval_runtime": 9.7006,
226
+ "eval_samples_per_second": 36.596,
227
+ "eval_steps_per_second": 9.175,
228
+ "epoch": 0.24861878453038674,
229
+ "step": 270
230
+ },
231
+ {
232
+ "loss": 1.4764,
233
+ "grad_norm": 5.288131237030029,
234
+ "learning_rate": 0.00018417230122868335,
235
+ "epoch": 0.2578268876611418,
236
+ "step": 280
237
+ },
238
+ {
239
+ "loss": 1.4763,
240
+ "grad_norm": 6.140823841094971,
241
+ "learning_rate": 0.00018240961094060572,
242
+ "epoch": 0.26703499079189685,
243
+ "step": 290
244
+ },
245
+ {
246
+ "loss": 1.4253,
247
+ "grad_norm": 4.705212593078613,
248
+ "learning_rate": 0.00018056326658970226,
249
+ "epoch": 0.27624309392265195,
250
+ "step": 300
251
+ },
252
+ {
253
+ "loss": 1.4912,
254
+ "grad_norm": 4.866296291351318,
255
+ "learning_rate": 0.00017863514240161932,
256
+ "epoch": 0.285451197053407,
257
+ "step": 310
258
+ },
259
+ {
260
+ "loss": 1.5495,
261
+ "grad_norm": 5.423320293426514,
262
+ "learning_rate": 0.00017662719561678216,
263
+ "epoch": 0.2946593001841621,
264
+ "step": 320
265
+ },
266
+ {
267
+ "eval_loss": 1.4837230443954468,
268
+ "eval_runtime": 9.6775,
269
+ "eval_samples_per_second": 36.683,
270
+ "eval_steps_per_second": 9.197,
271
+ "epoch": 0.2983425414364641,
272
+ "step": 324
273
+ },
274
+ {
275
+ "loss": 1.5233,
276
+ "grad_norm": 5.150054454803467,
277
+ "learning_rate": 0.00017454146450359876,
278
+ "epoch": 0.30386740331491713,
279
+ "step": 330
280
+ },
281
+ {
282
+ "loss": 1.4765,
283
+ "grad_norm": 4.734647274017334,
284
+ "learning_rate": 0.00017238006628941173,
285
+ "epoch": 0.31307550644567217,
286
+ "step": 340
287
+ },
288
+ {
289
+ "loss": 1.527,
290
+ "grad_norm": 6.608598232269287,
291
+ "learning_rate": 0.00017014519501129923,
292
+ "epoch": 0.32228360957642727,
293
+ "step": 350
294
+ },
295
+ {
296
+ "loss": 1.5924,
297
+ "grad_norm": 6.124919891357422,
298
+ "learning_rate": 0.00016783911928890618,
299
+ "epoch": 0.3314917127071823,
300
+ "step": 360
301
+ },
302
+ {
303
+ "loss": 1.5122,
304
+ "grad_norm": 5.096858978271484,
305
+ "learning_rate": 0.0001654641800215665,
306
+ "epoch": 0.3406998158379374,
307
+ "step": 370
308
+ },
309
+ {
310
+ "eval_loss": 1.4596896171569824,
311
+ "eval_runtime": 9.6761,
312
+ "eval_samples_per_second": 36.688,
313
+ "eval_steps_per_second": 9.198,
314
+ "epoch": 0.34806629834254144,
315
+ "step": 378
316
+ },
317
+ {
318
+ "loss": 1.5265,
319
+ "grad_norm": 4.963854789733887,
320
+ "learning_rate": 0.00016302278801205443,
321
+ "epoch": 0.34990791896869244,
322
+ "step": 380
323
+ },
324
+ {
325
+ "loss": 1.4015,
326
+ "grad_norm": 6.355607986450195,
327
+ "learning_rate": 0.00016051742151937655,
328
+ "epoch": 0.35911602209944754,
329
+ "step": 390
330
+ },
331
+ {
332
+ "loss": 1.5026,
333
+ "grad_norm": 4.335422039031982,
334
+ "learning_rate": 0.00015795062374308918,
335
+ "epoch": 0.3683241252302026,
336
+ "step": 400
337
+ },
338
+ {
339
+ "loss": 1.5513,
340
+ "grad_norm": 4.647507190704346,
341
+ "learning_rate": 0.00015532500024169446,
342
+ "epoch": 0.3775322283609576,
343
+ "step": 410
344
+ },
345
+ {
346
+ "loss": 1.4938,
347
+ "grad_norm": 5.171050548553467,
348
+ "learning_rate": 0.0001526432162877356,
349
+ "epoch": 0.3867403314917127,
350
+ "step": 420
351
+ },
352
+ {
353
+ "loss": 1.4513,
354
+ "grad_norm": 4.9742536544799805,
355
+ "learning_rate": 0.00014990799416227682,
356
+ "epoch": 0.39594843462246776,
357
+ "step": 430
358
+ },
359
+ {
360
+ "eval_loss": 1.4328629970550537,
361
+ "eval_runtime": 9.8993,
362
+ "eval_samples_per_second": 35.861,
363
+ "eval_steps_per_second": 8.991,
364
+ "epoch": 0.39779005524861877,
365
+ "step": 432
366
+ },
367
+ {
368
+ "loss": 1.5631,
369
+ "grad_norm": 4.611151218414307,
370
+ "learning_rate": 0.0001471221103915134,
371
+ "epoch": 0.40515653775322286,
372
+ "step": 440
373
+ },
374
+ {
375
+ "loss": 1.4699,
376
+ "grad_norm": 5.096474647521973,
377
+ "learning_rate": 0.00014428839292831801,
378
+ "epoch": 0.4143646408839779,
379
+ "step": 450
380
+ },
381
+ {
382
+ "loss": 1.3699,
383
+ "grad_norm": 4.659368515014648,
384
+ "learning_rate": 0.00014140971828158306,
385
+ "epoch": 0.42357274401473294,
386
+ "step": 460
387
+ },
388
+ {
389
+ "loss": 1.5189,
390
+ "grad_norm": 5.0686235427856445,
391
+ "learning_rate": 0.00013848900859627448,
392
+ "epoch": 0.43278084714548803,
393
+ "step": 470
394
+ },
395
+ {
396
+ "loss": 1.5023,
397
+ "grad_norm": 5.085811138153076,
398
+ "learning_rate": 0.00013552922868715988,
399
+ "epoch": 0.4419889502762431,
400
+ "step": 480
401
+ },
402
+ {
403
+ "eval_loss": 1.4397507905960083,
404
+ "eval_runtime": 10.0959,
405
+ "eval_samples_per_second": 35.163,
406
+ "eval_steps_per_second": 8.815,
407
+ "epoch": 0.44751381215469616,
408
+ "step": 486
409
+ },
410
+ {
411
+ "loss": 1.4776,
412
+ "grad_norm": 4.884566307067871,
413
+ "learning_rate": 0.00013253338302922268,
414
+ "epoch": 0.45119705340699817,
415
+ "step": 490
416
+ },
417
+ {
418
+ "loss": 1.4041,
419
+ "grad_norm": 5.0404229164123535,
420
+ "learning_rate": 0.00012950451270781727,
421
+ "epoch": 0.4604051565377532,
422
+ "step": 500
423
+ },
424
+ {
425
+ "loss": 1.4588,
426
+ "grad_norm": 5.768477916717529,
427
+ "learning_rate": 0.00012644569233166055,
428
+ "epoch": 0.4696132596685083,
429
+ "step": 510
430
+ },
431
+ {
432
+ "loss": 1.4851,
433
+ "grad_norm": 4.67483377456665,
434
+ "learning_rate": 0.0001233600269117943,
435
+ "epoch": 0.47882136279926335,
436
+ "step": 520
437
+ },
438
+ {
439
+ "loss": 1.5198,
440
+ "grad_norm": 4.369293212890625,
441
+ "learning_rate": 0.00012025064870968594,
442
+ "epoch": 0.4880294659300184,
443
+ "step": 530
444
+ },
445
+ {
446
+ "loss": 1.3767,
447
+ "grad_norm": 4.557173252105713,
448
+ "learning_rate": 0.00011712071405766735,
449
+ "epoch": 0.4972375690607735,
450
+ "step": 540
451
+ },
452
+ {
453
+ "eval_loss": 1.4258806705474854,
454
+ "eval_runtime": 9.9078,
455
+ "eval_samples_per_second": 35.83,
456
+ "eval_steps_per_second": 8.983,
457
+ "epoch": 0.4972375690607735,
458
+ "step": 540
459
+ },
460
+ {
461
+ "loss": 1.3845,
462
+ "grad_norm": 4.495442867279053,
463
+ "learning_rate": 0.00011397340015493934,
464
+ "epoch": 0.5064456721915286,
465
+ "step": 550
466
+ },
467
+ {
468
+ "loss": 1.3657,
469
+ "grad_norm": 4.659553527832031,
470
+ "learning_rate": 0.00011081190184239419,
471
+ "epoch": 0.5156537753222836,
472
+ "step": 560
473
+ },
474
+ {
475
+ "loss": 1.5381,
476
+ "grad_norm": 5.458863258361816,
477
+ "learning_rate": 0.00010763942835953012,
478
+ "epoch": 0.5248618784530387,
479
+ "step": 570
480
+ },
481
+ {
482
+ "loss": 1.3163,
483
+ "grad_norm": 4.973690986633301,
484
+ "learning_rate": 0.00010445920008674955,
485
+ "epoch": 0.5340699815837937,
486
+ "step": 580
487
+ },
488
+ {
489
+ "loss": 1.442,
490
+ "grad_norm": 4.4588236808776855,
491
+ "learning_rate": 0.00010127444527634855,
492
+ "epoch": 0.5432780847145487,
493
+ "step": 590
494
+ },
495
+ {
496
+ "eval_loss": 1.3906919956207275,
497
+ "eval_runtime": 9.8795,
498
+ "eval_samples_per_second": 35.933,
499
+ "eval_steps_per_second": 9.009,
500
+ "epoch": 0.5469613259668509,
501
+ "step": 594
502
+ },
503
+ {
504
+ "loss": 1.3652,
505
+ "grad_norm": 4.593173980712891,
506
+ "learning_rate": 9.808839677551511e-05,
507
+ "epoch": 0.5524861878453039,
508
+ "step": 600
509
+ },
510
+ {
511
+ "loss": 1.3837,
512
+ "grad_norm": 5.622270584106445,
513
+ "learning_rate": 9.490428874466344e-05,
514
+ "epoch": 0.5616942909760589,
515
+ "step": 610
516
+ },
517
+ {
518
+ "loss": 1.4205,
519
+ "grad_norm": 3.932429552078247,
520
+ "learning_rate": 9.172535337443507e-05,
521
+ "epoch": 0.570902394106814,
522
+ "step": 620
523
+ },
524
+ {
525
+ "loss": 1.4007,
526
+ "grad_norm": 4.891452312469482,
527
+ "learning_rate": 8.855481760469961e-05,
528
+ "epoch": 0.580110497237569,
529
+ "step": 630
530
+ },
531
+ {
532
+ "loss": 1.365,
533
+ "grad_norm": 5.47842264175415,
534
+ "learning_rate": 8.539589984888534e-05,
535
+ "epoch": 0.5893186003683242,
536
+ "step": 640
537
+ },
538
+ {
539
+ "eval_loss": 1.3598726987838745,
540
+ "eval_runtime": 9.786,
541
+ "eval_samples_per_second": 36.276,
542
+ "eval_steps_per_second": 9.095,
543
+ "epoch": 0.5966850828729282,
544
+ "step": 648
545
+ },
546
+ {
547
+ "loss": 1.4359,
548
+ "grad_norm": 4.814668655395508,
549
+ "learning_rate": 8.225180672696527e-05,
550
+ "epoch": 0.5985267034990792,
551
+ "step": 650
552
+ },
553
+ {
554
+ "loss": 1.4049,
555
+ "grad_norm": 4.641814231872559,
556
+ "learning_rate": 7.912572981041448e-05,
557
+ "epoch": 0.6077348066298343,
558
+ "step": 660
559
+ },
560
+ {
561
+ "loss": 1.4101,
562
+ "grad_norm": 5.327591419219971,
563
+ "learning_rate": 7.602084238244338e-05,
564
+ "epoch": 0.6169429097605893,
565
+ "step": 670
566
+ },
567
+ {
568
+ "loss": 1.3731,
569
+ "grad_norm": 4.462182998657227,
570
+ "learning_rate": 7.294029621679532e-05,
571
+ "epoch": 0.6261510128913443,
572
+ "step": 680
573
+ },
574
+ {
575
+ "loss": 1.3291,
576
+ "grad_norm": 4.188094139099121,
577
+ "learning_rate": 6.98872183783787e-05,
578
+ "epoch": 0.6353591160220995,
579
+ "step": 690
580
+ },
581
+ {
582
+ "loss": 1.3344,
583
+ "grad_norm": 4.551364898681641,
584
+ "learning_rate": 6.68647080489805e-05,
585
+ "epoch": 0.6445672191528545,
586
+ "step": 700
587
+ },
588
+ {
589
+ "eval_loss": 1.3378660678863525,
590
+ "eval_runtime": 9.8503,
591
+ "eval_samples_per_second": 36.039,
592
+ "eval_steps_per_second": 9.035,
593
+ "epoch": 0.6464088397790055,
594
+ "step": 702
595
+ },
596
+ {
597
+ "loss": 1.2724,
598
+ "grad_norm": 4.210637092590332,
599
+ "learning_rate": 6.387583338128471e-05,
600
+ "epoch": 0.6537753222836096,
601
+ "step": 710
602
+ },
603
+ {
604
+ "loss": 1.3625,
605
+ "grad_norm": 4.314669609069824,
606
+ "learning_rate": 6.092362838438772e-05,
607
+ "epoch": 0.6629834254143646,
608
+ "step": 720
609
+ },
610
+ {
611
+ "loss": 1.3277,
612
+ "grad_norm": 5.252039432525635,
613
+ "learning_rate": 5.801108984397354e-05,
614
+ "epoch": 0.6721915285451197,
615
+ "step": 730
616
+ },
617
+ {
618
+ "loss": 1.2975,
619
+ "grad_norm": 3.6448535919189453,
620
+ "learning_rate": 5.514117428027394e-05,
621
+ "epoch": 0.6813996316758748,
622
+ "step": 740
623
+ },
624
+ {
625
+ "loss": 1.3767,
626
+ "grad_norm": 4.960095405578613,
627
+ "learning_rate": 5.2316794946902533e-05,
628
+ "epoch": 0.6906077348066298,
629
+ "step": 750
630
+ },
631
+ {
632
+ "eval_loss": 1.324850082397461,
633
+ "eval_runtime": 10.0053,
634
+ "eval_samples_per_second": 35.481,
635
+ "eval_steps_per_second": 8.895,
636
+ "epoch": 0.6961325966850829,
637
+ "step": 756
638
+ },
639
+ {
640
+ "loss": 1.4546,
641
+ "grad_norm": 4.837981224060059,
642
+ "learning_rate": 4.954081887360873e-05,
643
+ "epoch": 0.6998158379373849,
644
+ "step": 760
645
+ },
646
+ {
647
+ "loss": 1.4597,
648
+ "grad_norm": 5.846377372741699,
649
+ "learning_rate": 4.681606395595325e-05,
650
+ "epoch": 0.7090239410681399,
651
+ "step": 770
652
+ },
653
+ {
654
+ "loss": 1.2981,
655
+ "grad_norm": 4.527792930603027,
656
+ "learning_rate": 4.4145296094860366e-05,
657
+ "epoch": 0.7182320441988951,
658
+ "step": 780
659
+ },
660
+ {
661
+ "loss": 1.3727,
662
+ "grad_norm": 4.371113300323486,
663
+ "learning_rate": 4.153122638894952e-05,
664
+ "epoch": 0.7274401473296501,
665
+ "step": 790
666
+ },
667
+ {
668
+ "loss": 1.3613,
669
+ "grad_norm": 3.6618120670318604,
670
+ "learning_rate": 3.8976508382496745e-05,
671
+ "epoch": 0.7366482504604052,
672
+ "step": 800
673
+ },
674
+ {
675
+ "loss": 1.3558,
676
+ "grad_norm": 4.440310478210449,
677
+ "learning_rate": 3.648373537182001e-05,
678
+ "epoch": 0.7458563535911602,
679
+ "step": 810
680
+ },
681
+ {
682
+ "eval_loss": 1.325706958770752,
683
+ "eval_runtime": 9.8213,
684
+ "eval_samples_per_second": 36.146,
685
+ "eval_steps_per_second": 9.062,
686
+ "epoch": 0.7458563535911602,
687
+ "step": 810
688
+ },
689
+ {
690
+ "loss": 1.289,
691
+ "grad_norm": 4.077640533447266,
692
+ "learning_rate": 3.40554377728219e-05,
693
+ "epoch": 0.7550644567219152,
694
+ "step": 820
695
+ },
696
+ {
697
+ "loss": 1.3105,
698
+ "grad_norm": 4.2821044921875,
699
+ "learning_rate": 3.1694080552362224e-05,
700
+ "epoch": 0.7642725598526704,
701
+ "step": 830
702
+ },
703
+ {
704
+ "loss": 1.2668,
705
+ "grad_norm": 4.420470714569092,
706
+ "learning_rate": 2.9402060726068492e-05,
707
+ "epoch": 0.7734806629834254,
708
+ "step": 840
709
+ },
710
+ {
711
+ "loss": 1.3542,
712
+ "grad_norm": 4.674584865570068,
713
+ "learning_rate": 2.7181704925123075e-05,
714
+ "epoch": 0.7826887661141805,
715
+ "step": 850
716
+ },
717
+ {
718
+ "loss": 1.3277,
719
+ "grad_norm": 4.855812072753906,
720
+ "learning_rate": 2.5035267034498243e-05,
721
+ "epoch": 0.7918968692449355,
722
+ "step": 860
723
+ },
724
+ {
725
+ "eval_loss": 1.317618489265442,
726
+ "eval_runtime": 9.7915,
727
+ "eval_samples_per_second": 36.256,
728
+ "eval_steps_per_second": 9.089,
729
+ "epoch": 0.7955801104972375,
730
+ "step": 864
731
+ },
732
+ {
733
+ "loss": 1.4031,
734
+ "grad_norm": 4.8453803062438965,
735
+ "learning_rate": 2.296492590503564e-05,
736
+ "epoch": 0.8011049723756906,
737
+ "step": 870
738
+ },
739
+ {
740
+ "loss": 1.3857,
741
+ "grad_norm": 4.110125541687012,
742
+ "learning_rate": 2.0972783141692898e-05,
743
+ "epoch": 0.8103130755064457,
744
+ "step": 880
745
+ },
746
+ {
747
+ "loss": 1.3671,
748
+ "grad_norm": 4.354367256164551,
749
+ "learning_rate": 1.9060860970202955e-05,
750
+ "epoch": 0.8195211786372008,
751
+ "step": 890
752
+ },
753
+ {
754
+ "loss": 1.3786,
755
+ "grad_norm": 4.151617050170898,
756
+ "learning_rate": 1.7231100184310956e-05,
757
+ "epoch": 0.8287292817679558,
758
+ "step": 900
759
+ },
760
+ {
761
+ "loss": 1.4231,
762
+ "grad_norm": 4.310825347900391,
763
+ "learning_rate": 1.5485358175672927e-05,
764
+ "epoch": 0.8379373848987108,
765
+ "step": 910
766
+ },
767
+ {
768
+ "eval_loss": 1.3100452423095703,
769
+ "eval_runtime": 9.6199,
770
+ "eval_samples_per_second": 36.903,
771
+ "eval_steps_per_second": 9.252,
772
+ "epoch": 0.8453038674033149,
773
+ "step": 918
774
+ },
775
+ {
776
+ "loss": 1.3932,
777
+ "grad_norm": 4.235050678253174,
778
+ "learning_rate": 1.382540704841604e-05,
779
+ "epoch": 0.8471454880294659,
780
+ "step": 920
781
+ },
782
+ {
783
+ "loss": 1.3509,
784
+ "grad_norm": 4.722878456115723,
785
+ "learning_rate": 1.2252931820274327e-05,
786
+ "epoch": 0.856353591160221,
787
+ "step": 930
788
+ },
789
+ {
790
+ "loss": 1.351,
791
+ "grad_norm": 5.233123302459717,
792
+ "learning_rate": 1.0769528712125731e-05,
793
+ "epoch": 0.8655616942909761,
794
+ "step": 940
795
+ },
796
+ {
797
+ "loss": 1.2573,
798
+ "grad_norm": 4.566328048706055,
799
+ "learning_rate": 9.376703527667063e-06,
800
+ "epoch": 0.8747697974217311,
801
+ "step": 950
802
+ },
803
+ {
804
+ "loss": 1.3823,
805
+ "grad_norm": 5.684493064880371,
806
+ "learning_rate": 8.075870124871353e-06,
807
+ "epoch": 0.8839779005524862,
808
+ "step": 960
809
+ },
810
+ {
811
+ "loss": 1.4003,
812
+ "grad_norm": 4.066319942474365,
813
+ "learning_rate": 6.868348980779593e-06,
814
+ "epoch": 0.8931860036832413,
815
+ "step": 970
816
+ },
817
+ {
818
+ "eval_loss": 1.303568720817566,
819
+ "eval_runtime": 9.8347,
820
+ "eval_samples_per_second": 36.097,
821
+ "eval_steps_per_second": 9.05,
822
+ "epoch": 0.8950276243093923,
823
+ "step": 972
824
+ },
825
+ {
826
+ "loss": 1.3237,
827
+ "grad_norm": 3.711812734603882,
828
+ "learning_rate": 5.7553658510833945e-06,
829
+ "epoch": 0.9023941068139963,
830
+ "step": 980
831
+ },
832
+ {
833
+ "loss": 1.3512,
834
+ "grad_norm": 4.245476722717285,
835
+ "learning_rate": 4.738050525859317e-06,
836
+ "epoch": 0.9116022099447514,
837
+ "step": 990
838
+ },
839
+ {
840
+ "loss": 1.3341,
841
+ "grad_norm": 3.881774425506592,
842
+ "learning_rate": 3.817435682718096e-06,
843
+ "epoch": 0.9208103130755064,
844
+ "step": 1000
845
+ },
846
+ {
847
+ "loss": 1.3007,
848
+ "grad_norm": 5.654109477996826,
849
+ "learning_rate": 2.994455838532828e-06,
850
+ "epoch": 0.9300184162062615,
851
+ "step": 1010
852
+ },
853
+ {
854
+ "loss": 1.2577,
855
+ "grad_norm": 4.331023693084717,
856
+ "learning_rate": 2.269946400810041e-06,
857
+ "epoch": 0.9392265193370166,
858
+ "step": 1020
859
+ },
860
+ {
861
+ "eval_loss": 1.2990461587905884,
862
+ "eval_runtime": 9.5776,
863
+ "eval_samples_per_second": 37.066,
864
+ "eval_steps_per_second": 9.293,
865
+ "epoch": 0.9447513812154696,
866
+ "step": 1026
867
+ },
868
+ {
869
+ "loss": 1.3371,
870
+ "grad_norm": 4.437676906585693,
871
+ "learning_rate": 1.644642819666886e-06,
872
+ "epoch": 0.9484346224677717,
873
+ "step": 1030
874
+ },
875
+ {
876
+ "loss": 1.3207,
877
+ "grad_norm": 6.489171981811523,
878
+ "learning_rate": 1.119179841275131e-06,
879
+ "epoch": 0.9576427255985267,
880
+ "step": 1040
881
+ },
882
+ {
883
+ "loss": 1.3583,
884
+ "grad_norm": 4.911295413970947,
885
+ "learning_rate": 6.940908635298283e-07,
886
+ "epoch": 0.9668508287292817,
887
+ "step": 1050
888
+ },
889
+ {
890
+ "loss": 1.2529,
891
+ "grad_norm": 4.163908004760742,
892
+ "learning_rate": 3.6980739459665517e-07,
893
+ "epoch": 0.9760589318600368,
894
+ "step": 1060
895
+ },
896
+ {
897
+ "loss": 1.4483,
898
+ "grad_norm": 5.159338474273682,
899
+ "learning_rate": 1.4665861488761813e-07,
900
+ "epoch": 0.9852670349907919,
901
+ "step": 1070
902
+ },
903
+ {
904
+ "loss": 1.278,
905
+ "grad_norm": 4.065121173858643,
906
+ "learning_rate": 2.4871042909768715e-08,
907
+ "epoch": 0.994475138121547,
908
+ "step": 1080
909
+ },
910
+ {
911
+ "eval_loss": 1.2995774745941162,
912
+ "eval_runtime": 9.8586,
913
+ "eval_samples_per_second": 36.009,
914
+ "eval_steps_per_second": 9.028,
915
+ "epoch": 0.994475138121547,
916
+ "step": 1080
917
+ },
918
+ {
919
+ "train_runtime": 1140.8033,
920
+ "train_samples_per_second": 15.228,
921
+ "train_steps_per_second": 0.952,
922
+ "total_flos": 7629500232960000.0,
923
+ "train_loss": 1.5016714283994108,
924
+ "epoch": 1.0,
925
+ "step": 1086
926
+ }
927
+ ]
train/training_loss.png ADDED
train/validation_loss.png ADDED