joelleachkar commited on
Commit
f1a22ca
·
verified ·
1 Parent(s): e85fa38

Upload final SFT boxed-only general knowledge model

Browse files
Files changed (2) hide show
  1. README.md +155 -24
  2. model.safetensors +1 -1
README.md CHANGED
@@ -1,40 +1,171 @@
1
  ---
2
  license: apache-2.0
 
 
 
 
 
 
 
 
 
 
3
  ---
4
 
5
- ## v3 GRPO general knowledge model
6
 
7
- Updated: 2026-06-04 11:22 UTC
8
 
9
- This repository stores the final v3 GRPO general knowledge model for the CS-552 2026 Databand project.
10
 
11
- Model source on the training cluster:
12
 
13
- /scratch/general_knowledge_sft_v3_lora_grpo/outputs/grpo_v3_maxredux_4000/final
14
 
15
- The model was trained from the v3 LoRA SFT model using GRPO on the MMLU-Pro / MMLU-Redux general-knowledge data split.
16
 
17
- The final model files were verified locally before upload, including:
18
 
19
- - config.json
20
- - generation_config.json
21
- - model.safetensors
22
- - tokenizer.json
23
- - tokenizer_config.json
24
- - chat_template.jinja
25
 
26
- Important generation/config fields:
27
 
28
- - bos_token_id = 151643
29
- - eos_token_id = 151645
30
- - pad_token_id = 151643
31
- - use_cache = True
32
- - generation eos_token_id = [151645, 151643]
33
- - temperature = 0.1
34
- - top_k = 20
35
- - top_p = 0.8
36
 
37
- Expected output format:
38
 
39
- \boxed{LETTER}
 
 
 
40
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
+ base_model: Qwen/Qwen3-1.7B
4
+ library_name: transformers
5
+ pipeline_tag: text-generation
6
+ tags:
7
+ - qwen3
8
+ - multiple-choice
9
+ - general-knowledge
10
+ - lora
11
+ - sft
12
+ - boxed-answer
13
  ---
14
 
15
+ # General Knowledge Model
16
 
17
+ This is the final General Knowledge individual model for the CS-552 Modern NLP Spring 2026 standardized project.
18
 
19
+ The submitted model is the SFT-only merged model. A later DPO experiment was run on ARC/CommonsenseQA mistakes, but it reduced benchmark accuracy, so it was not selected as the final model.
20
 
21
+ ## Model behavior
22
 
23
+ The model is specialized for multiple-choice general knowledge questions. It is prompted to output exactly one final boxed answer, for example:
24
 
25
+ \boxed{A}
26
 
27
+ The chat template enforces concise answer-only behavior and supports choices labeled from A through T.
28
 
29
+ ## Training setup
 
 
 
 
 
30
 
31
+ Starting point:
32
 
33
+ - Baseline working model folder with the project chat template and generation config
34
+ - LoRA SFT on top of the baseline model
35
+ - Final model produced by merging the LoRA adapter into the baseline model
 
 
 
 
 
36
 
37
+ Training method:
38
 
39
+ - LoRA supervised fine-tuning
40
+ - Loss masked so that only the final assistant boxed answer contributes to training
41
+ - Prompt, system message, question text, choices, chat markers, and template tokens are masked with -100
42
+ - Assistant target format: \boxed{LETTER}
43
 
44
+ LoRA configuration:
45
+
46
+ - r = 16
47
+ - lora_alpha = 32
48
+ - lora_dropout = 0.05
49
+ - Target modules:
50
+ - q_proj
51
+ - k_proj
52
+ - v_proj
53
+ - o_proj
54
+ - gate_proj
55
+ - up_proj
56
+ - down_proj
57
+
58
+ Main training hyperparameters:
59
+
60
+ - Learning rate: 8e-5
61
+ - Epochs: 1
62
+ - Batch size per device: 1
63
+ - Gradient accumulation steps: 8
64
+ - Max sequence length: 8192
65
+ - Precision: bf16
66
+ - Scheduler: cosine
67
+ - Warmup steps: 20
68
+
69
+ ## SFT datasets
70
+
71
+ The SFT training data was built from:
72
+
73
+ 1. Kaggle LLM Science
74
+ 2. EduQG
75
+ 3. EduAdapt, MCQ-only questions
76
+ 4. NCERT_MCQs
77
+ 5. SciQ train
78
+ 6. OpenBookQA train
79
+
80
+ The final SFT dataset was capped below 30,000 training rows.
81
+
82
+ Final SFT data sizes:
83
+
84
+ - Train: 26,120
85
+ - Validation: 2,000
86
+
87
+ The answer labels were balanced uniformly across A through T separately for train and validation.
88
+
89
+ Train answer distribution:
90
+
91
+ - A through T: 1,306 examples each
92
+
93
+ Validation answer distribution:
94
+
95
+ - A through T: 100 examples each
96
+
97
+ ## Evaluation
98
+
99
+ The final selected model is the SFT-only merged model.
100
+
101
+ Evaluation sets:
102
+
103
+ - Validation set: 2,000
104
+ - MMLU Pro: 2,000, uniformly sampled across categories
105
+ - MMLU Redux: 2,000, uniformly sampled across subjects
106
+ - SuperGPQA: 2,000, uniformly sampled across disciplines
107
+
108
+ SFT-only results:
109
+
110
+ | Benchmark | Boxed extraction | Accuracy |
111
+ |---|---:|---:|
112
+ | MMLU Pro 2k | 100.00% | 37.85% |
113
+ | MMLU Redux 2k | 100.00% | 56.25% |
114
+ | SuperGPQA 2k | 99.95% | 27.55% |
115
+
116
+ Baseline comparison:
117
+
118
+ | Benchmark | Baseline accuracy | SFT-only accuracy |
119
+ |---|---:|---:|
120
+ | Validation 2k | 16.00% | not logged in the final grep output |
121
+ | MMLU Pro 2k | 18.05% | 37.85% |
122
+ | MMLU Redux 2k | 11.40% | 56.25% |
123
+ | SuperGPQA 2k | 15.85% | 27.55% |
124
+
125
+ DPO experiment, not selected:
126
+
127
+ | Benchmark | SFT + DPO accuracy |
128
+ |---|---:|
129
+ | Validation 2k | 79.75% |
130
+ | MMLU Pro 2k | 35.25% |
131
+ | MMLU Redux 2k | 50.90% |
132
+ | SuperGPQA 2k | 23.45% |
133
+
134
+ DPO improved the internal validation metric but reduced the external benchmark scores, so the SFT-only model was selected.
135
+
136
+ ## Expected input format
137
+
138
+ The model expects a multiple-choice question formatted like:
139
+
140
+ Question text here?
141
+
142
+ Choices:
143
+ A. first option
144
+ B. second option
145
+ C. third option
146
+ D. fourth option
147
+
148
+ It should answer with only:
149
+
150
+ \boxed{A}
151
+
152
+ ## Reproducibility notes
153
+
154
+ Important files from the training folder:
155
+
156
+ - SFT trainer: scripts/train_v3_lora_sft_masked.py
157
+ - SFT data builder: scripts/build_my_sft_data_balanced.py
158
+ - Merge script: scripts/merge_v3_lora_adapter.py
159
+ - Evaluation script: scripts/evaluate_mcq_accuracy.py
160
+
161
+ Final selected model folder before upload:
162
+
163
+ outputs/lora_sft_v3_boxed_only/merged_full_model
164
+
165
+ SFT LoRA adapter:
166
+
167
+ outputs/lora_sft_v3_boxed_only/final_adapter
168
+
169
+ DPO adapter, experimental and not selected:
170
+
171
+ outputs/lora_dpo_arc_csqa_on_sft/final_adapter
model.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:813eaad9f372af34c6dbe827ef83b2f6fa4242f384b4b4513f0aa5030b9e5e20
3
  size 3441185608
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:bbac80fd49d664b0096daf5c93346809d1275bde1375705d0bda731204b5ab90
3
  size 3441185608