How to use from
Docker Model Runner
docker model run hf.co/cs-552-2026-databand/general_knowledge_model
Quick Links

General Knowledge Model

This is the final General Knowledge individual model for the CS-552 Modern NLP Spring 2026 standardized project.

The submitted model is the SFT-only merged model. A later DPO experiment was run on ARC/CommonsenseQA mistakes, but it reduced external benchmark accuracy, so it was not selected as the final model.

Model behavior

The model is specialized for multiple-choice general knowledge questions. It is prompted to output exactly one final boxed answer, for example:

\boxed{A}

The chat template enforces concise answer-only behavior and supports choices labeled from A through T.

Training setup

Starting point:

  • Baseline working model folder with the project chat template and generation config
  • LoRA SFT on top of the baseline model
  • Final model produced by merging the LoRA adapter into the baseline model

Training method:

  • LoRA supervised fine-tuning
  • Loss masked so that only the final assistant boxed answer contributes to training
  • Prompt, system message, question text, choices, chat markers, and template tokens are masked with -100
  • Assistant target format: \boxed{LETTER}

LoRA configuration:

  • r = 16
  • lora_alpha = 32
  • lora_dropout = 0.05
  • Target modules:
    • q_proj
    • k_proj
    • v_proj
    • o_proj
    • gate_proj
    • up_proj
    • down_proj

Main training hyperparameters:

  • Learning rate: 8e-5
  • Epochs: 1
  • Batch size per device: 1
  • Gradient accumulation steps: 8
  • Max sequence length: 8192
  • Precision: bf16
  • Scheduler: cosine
  • Warmup steps: 20

SFT datasets

The SFT training data was built from:

  1. Kaggle LLM Science
  2. EduQG
  3. EduAdapt, MCQ-only questions
  4. NCERT_MCQs
  5. SciQ train
  6. OpenBookQA train

Final SFT data sizes:

  • Train: 26,120
  • Validation: 2,000

The answer labels were balanced uniformly across A through T separately for train and validation.

Train answer distribution:

  • A through T: 1,306 examples each

Validation answer distribution:

  • A through T: 100 examples each

Evaluation

The final selected model is the SFT-only merged model.

The “SFT validation” set in the table is the held-out validation set created from the same six dataset families used for LoRA SFT training: Kaggle LLM Science, EduQG, EduAdapt MCQ, NCERT_MCQs, SciQ, and OpenBookQA. It contains 2,000 examples and is answer-balanced across A through T.

External benchmark sets:

  • MMLU Pro: 2,000 examples, uniformly sampled across categories
  • MMLU Redux: 2,000 examples, uniformly sampled across subjects
  • SuperGPQA: 2,000 examples, uniformly sampled across disciplines
Evaluation set Baseline boxed Baseline accuracy SFT-only boxed SFT-only accuracy SFT + DPO boxed SFT + DPO accuracy
SFT validation 2k 19.20% 16.00% 100.00% 85.30% 100.00% 79.75%
MMLU Pro 2k 60.25% 18.05% 100.00% 37.85% 100.00% 35.25%
MMLU Redux 2k 26.65% 11.40% 100.00% 56.25% 100.00% 50.90%
SuperGPQA 2k 66.95% 15.85% 99.95% 27.55% 100.00% 23.45%

The DPO experiment improved neither the selected SFT validation score nor the external benchmark scores. Therefore, the SFT-only merged model was selected as the final model.

SFT validation details

SFT-only evaluation on the held-out SFT validation set:

  • Total: 2,000
  • Extracted boxed answer: 2,000 / 2,000 = 100.00%
  • Accuracy: 1,706 / 2,000 = 85.30%

Accuracy by validation source:

Source Accuracy Boxed extraction
eduadapt 82.35% (14/17) 100.00% (17/17)
eduqg 76.86% (93/121) 100.00% (121/121)
kaggle_llm_science 58.30% (130/223) 100.00% (223/223)
ncert_mcqs 93.33% (14/15) 100.00% (15/15)
openbookqa 80.83% (430/532) 100.00% (532/532)
sciq 93.86% (1025/1092) 100.00% (1092/1092)

SuperGPQA boxed-answer edge case

The SFT-only model produced boxed answers for 1,999 out of 2,000 SuperGPQA examples. The single unboxed example was a long, LaTeX-heavy numerical analysis question whose answer choices contained multi-line mathematical derivations. Instead of producing a boxed option, the model continued/copy-completed part of one answer choice, generating text beginning with:

mathrm{d} x^{2} = 2.1730$ and $| R_{1} | ...

Increasing max_new_tokens from 20 to 64 did not change this outcome. The reported SuperGPQA result therefore keeps the strict extraction score of 99.95%.

Expected input format

The model expects a multiple-choice question formatted like:

Question text here?

Choices: A. first option B. second option C. third option D. fourth option

It should answer with only:

\boxed{A}

Reproducibility notes

Important files from the training folder:

  • SFT trainer: scripts/train_v3_lora_sft_masked.py
  • SFT data builder: scripts/build_my_sft_data_balanced.py
  • DPO trainer used for the unselected experiment: scripts/train_v3_lora_dpo_boxed.py
  • Merge script: scripts/merge_v3_lora_adapter.py
  • Evaluation script: scripts/evaluate_mcq_accuracy.py

Final selected model folder before upload:

outputs/lora_sft_v3_boxed_only/merged_full_model

SFT LoRA adapter:

outputs/lora_sft_v3_boxed_only/final_adapter

DPO adapter, experimental and not selected:

outputs/lora_dpo_arc_csqa_on_sft/final_adapter

Downloads last month
191
Safetensors
Model size
2B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for cs-552-2026-databand/general_knowledge_model

Finetuned
Qwen/Qwen3-1.7B
Adapter
(518)
this model
Merges
1 model