--- license: apache-2.0 base_model: Qwen/Qwen3-1.7B library_name: transformers pipeline_tag: text-generation tags: - qwen3 - multiple-choice - general-knowledge - lora - sft - boxed-answer --- # General Knowledge Model This is the final General Knowledge individual model for the CS-552 Modern NLP Spring 2026 standardized project. The submitted model is the **SFT-only merged model**. A later DPO experiment was run on ARC/CommonsenseQA mistakes, but it reduced external benchmark accuracy, so it was **not selected** as the final model. ## Model behavior The model is specialized for multiple-choice general knowledge questions. It is prompted to output exactly one final boxed answer, for example: \boxed{A} The chat template enforces concise answer-only behavior and supports choices labeled from A through T. ## Training setup Starting point: - Baseline working model folder with the project chat template and generation config - LoRA SFT on top of the baseline model - Final model produced by merging the LoRA adapter into the baseline model Training method: - LoRA supervised fine-tuning - Loss masked so that only the final assistant boxed answer contributes to training - Prompt, system message, question text, choices, chat markers, and template tokens are masked with -100 - Assistant target format: \boxed{LETTER} LoRA configuration: - r = 16 - lora_alpha = 32 - lora_dropout = 0.05 - Target modules: - q_proj - k_proj - v_proj - o_proj - gate_proj - up_proj - down_proj Main training hyperparameters: - Learning rate: 8e-5 - Epochs: 1 - Batch size per device: 1 - Gradient accumulation steps: 8 - Max sequence length: 8192 - Precision: bf16 - Scheduler: cosine - Warmup steps: 20 ## SFT datasets The SFT training data was built from: 1. Kaggle LLM Science 2. EduQG 3. EduAdapt, MCQ-only questions 4. NCERT_MCQs 5. SciQ train 6. OpenBookQA train Final SFT data sizes: - Train: 26,120 - Validation: 2,000 The answer labels were balanced uniformly across A through T separately for train and validation. Train answer distribution: - A through T: 1,306 examples each Validation answer distribution: - A through T: 100 examples each ## Evaluation The final selected model is the SFT-only merged model. The “SFT validation” set in the table is the held-out validation set created from the same six dataset families used for LoRA SFT training: Kaggle LLM Science, EduQG, EduAdapt MCQ, NCERT_MCQs, SciQ, and OpenBookQA. It contains 2,000 examples and is answer-balanced across A through T. External benchmark sets: - MMLU Pro: 2,000 examples, uniformly sampled across categories - MMLU Redux: 2,000 examples, uniformly sampled across subjects - SuperGPQA: 2,000 examples, uniformly sampled across disciplines | Evaluation set | Baseline boxed | Baseline accuracy | SFT-only boxed | SFT-only accuracy | SFT + DPO boxed | SFT + DPO accuracy | |---|---:|---:|---:|---:|---:|---:| | SFT validation 2k | 19.20% | 16.00% | 100.00% | 85.30% | 100.00% | 79.75% | | MMLU Pro 2k | 60.25% | 18.05% | 100.00% | 37.85% | 100.00% | 35.25% | | MMLU Redux 2k | 26.65% | 11.40% | 100.00% | 56.25% | 100.00% | 50.90% | | SuperGPQA 2k | 66.95% | 15.85% | 99.95% | 27.55% | 100.00% | 23.45% | The DPO experiment improved neither the selected SFT validation score nor the external benchmark scores. Therefore, the SFT-only merged model was selected as the final model. ### SFT validation details SFT-only evaluation on the held-out SFT validation set: - Total: 2,000 - Extracted boxed answer: 2,000 / 2,000 = 100.00% - Accuracy: 1,706 / 2,000 = 85.30% Accuracy by validation source: | Source | Accuracy | Boxed extraction | |---|---:|---:| | eduadapt | 82.35% (14/17) | 100.00% (17/17) | | eduqg | 76.86% (93/121) | 100.00% (121/121) | | kaggle_llm_science | 58.30% (130/223) | 100.00% (223/223) | | ncert_mcqs | 93.33% (14/15) | 100.00% (15/15) | | openbookqa | 80.83% (430/532) | 100.00% (532/532) | | sciq | 93.86% (1025/1092) | 100.00% (1092/1092) | ### SuperGPQA boxed-answer edge case The SFT-only model produced boxed answers for 1,999 out of 2,000 SuperGPQA examples. The single unboxed example was a long, LaTeX-heavy numerical analysis question whose answer choices contained multi-line mathematical derivations. Instead of producing a boxed option, the model continued/copy-completed part of one answer choice, generating text beginning with: mathrm{d} x^{2} = 2.1730$ and $| R_{1} | ... Increasing max_new_tokens from 20 to 64 did not change this outcome. The reported SuperGPQA result therefore keeps the strict extraction score of 99.95%. ## Expected input format The model expects a multiple-choice question formatted like: Question text here? Choices: A. first option B. second option C. third option D. fourth option It should answer with only: \boxed{A} ## Reproducibility notes Important files from the training folder: - SFT trainer: scripts/train_v3_lora_sft_masked.py - SFT data builder: scripts/build_my_sft_data_balanced.py - DPO trainer used for the unselected experiment: scripts/train_v3_lora_dpo_boxed.py - Merge script: scripts/merge_v3_lora_adapter.py - Evaluation script: scripts/evaluate_mcq_accuracy.py Final selected model folder before upload: outputs/lora_sft_v3_boxed_only/merged_full_model SFT LoRA adapter: outputs/lora_sft_v3_boxed_only/final_adapter DPO adapter, experimental and not selected: outputs/lora_dpo_arc_csqa_on_sft/final_adapter