---
license: apache-2.0
base_model: Qwen/Qwen3-1.7B
library_name: transformers
pipeline_tag: text-generation
tags:
- qwen3
- multiple-choice
- general-knowledge
- lora
- sft
- boxed-answer
---

# General Knowledge Model

This is the final General Knowledge individual model for the CS-552 Modern NLP Spring 2026 standardized project.

The submitted model is the **SFT-only merged model**. A later DPO experiment was run on ARC/CommonsenseQA mistakes, but it reduced external benchmark accuracy, so it was **not selected** as the final model.

## Model behavior

The model is specialized for multiple-choice general knowledge questions. It is prompted to output exactly one final boxed answer, for example:

\boxed{A}

The chat template enforces concise answer-only behavior and supports choices labeled from A through T.

## Training setup

Starting point:

- Baseline working model folder with the project chat template and generation config
- LoRA SFT on top of the baseline model
- Final model produced by merging the LoRA adapter into the baseline model

Training method:

- LoRA supervised fine-tuning
- Loss masked so that only the final assistant boxed answer contributes to training
- Prompt, system message, question text, choices, chat markers, and template tokens are masked with -100
- Assistant target format: \boxed{LETTER}

LoRA configuration:

- r = 16
- lora_alpha = 32
- lora_dropout = 0.05
- Target modules:
  - q_proj
  - k_proj
  - v_proj
  - o_proj
  - gate_proj
  - up_proj
  - down_proj

Main training hyperparameters:

- Learning rate: 8e-5
- Epochs: 1
- Batch size per device: 1
- Gradient accumulation steps: 8
- Max sequence length: 8192
- Precision: bf16
- Scheduler: cosine
- Warmup steps: 20

## SFT datasets

The SFT training data was built from:

1. Kaggle LLM Science
2. EduQG
3. EduAdapt, MCQ-only questions
4. NCERT_MCQs
5. SciQ train
6. OpenBookQA train

Final SFT data sizes:

- Train: 26,120
- Validation: 2,000

The answer labels were balanced uniformly across A through T separately for train and validation.

Train answer distribution:

- A through T: 1,306 examples each

Validation answer distribution:

- A through T: 100 examples each

## Evaluation

The final selected model is the SFT-only merged model.

The “SFT validation” set in the table is the held-out validation set created from the same six dataset families used for LoRA SFT training: Kaggle LLM Science, EduQG, EduAdapt MCQ, NCERT_MCQs, SciQ, and OpenBookQA. It contains 2,000 examples and is answer-balanced across A through T.

External benchmark sets:

- MMLU Pro: 2,000 examples, uniformly sampled across categories
- MMLU Redux: 2,000 examples, uniformly sampled across subjects
- SuperGPQA: 2,000 examples, uniformly sampled across disciplines

| Evaluation set | Baseline boxed | Baseline accuracy | SFT-only boxed | SFT-only accuracy | SFT + DPO boxed | SFT + DPO accuracy |
|---|---:|---:|---:|---:|---:|---:|
| SFT validation 2k | 19.20% | 16.00% | 100.00% | 85.30% | 100.00% | 79.75% |
| MMLU Pro 2k | 60.25% | 18.05% | 100.00% | 37.85% | 100.00% | 35.25% |
| MMLU Redux 2k | 26.65% | 11.40% | 100.00% | 56.25% | 100.00% | 50.90% |
| SuperGPQA 2k | 66.95% | 15.85% | 99.95% | 27.55% | 100.00% | 23.45% |

The DPO experiment improved neither the selected SFT validation score nor the external benchmark scores. Therefore, the SFT-only merged model was selected as the final model.

### SFT validation details

SFT-only evaluation on the held-out SFT validation set:

- Total: 2,000
- Extracted boxed answer: 2,000 / 2,000 = 100.00%
- Accuracy: 1,706 / 2,000 = 85.30%

Accuracy by validation source:

| Source | Accuracy | Boxed extraction |
|---|---:|---:|
| eduadapt | 82.35% (14/17) | 100.00% (17/17) |
| eduqg | 76.86% (93/121) | 100.00% (121/121) |
| kaggle_llm_science | 58.30% (130/223) | 100.00% (223/223) |
| ncert_mcqs | 93.33% (14/15) | 100.00% (15/15) |
| openbookqa | 80.83% (430/532) | 100.00% (532/532) |
| sciq | 93.86% (1025/1092) | 100.00% (1092/1092) |

### SuperGPQA boxed-answer edge case

The SFT-only model produced boxed answers for 1,999 out of 2,000 SuperGPQA examples. The single unboxed example was a long, LaTeX-heavy numerical analysis question whose answer choices contained multi-line mathematical derivations. Instead of producing a boxed option, the model continued/copy-completed part of one answer choice, generating text beginning with:

mathrm{d} x^{2} = 2.1730$ and $| R_{1} | ...

Increasing max_new_tokens from 20 to 64 did not change this outcome. The reported SuperGPQA result therefore keeps the strict extraction score of 99.95%.

## Expected input format

The model expects a multiple-choice question formatted like:

Question text here?

Choices:
A. first option
B. second option
C. third option
D. fourth option

It should answer with only:

\boxed{A}

## Reproducibility notes

Important files from the training folder:

- SFT trainer: scripts/train_v3_lora_sft_masked.py
- SFT data builder: scripts/build_my_sft_data_balanced.py
- DPO trainer used for the unselected experiment: scripts/train_v3_lora_dpo_boxed.py
- Merge script: scripts/merge_v3_lora_adapter.py
- Evaluation script: scripts/evaluate_mcq_accuracy.py

Final selected model folder before upload:

outputs/lora_sft_v3_boxed_only/merged_full_model

SFT LoRA adapter:

outputs/lora_sft_v3_boxed_only/final_adapter

DPO adapter, experimental and not selected:

outputs/lora_dpo_arc_csqa_on_sft/final_adapter