Qwen2.5-1.5B-Instruct — QLoRA Fine-Tuned on OpenThoughts-114k

A QLoRA adapter for Qwen/Qwen2.5-1.5B-Instruct, fine-tuned on curated reasoning traces from OpenThoughts-114k to produce clean, structured, step-by-step solutions.

Key Details


Base Model	Qwen/Qwen2.5-1.5B-Instruct
Method	QLoRA (4-bit NF4 + LoRA)
Dataset	30K samples from OpenThoughts-114k
Hardware	Single NVIDIA T4 (16GB VRAM, free Colab)
Adapter Size	~50MB
Trainable Params	~1.5% of total model parameters

What This Adapter Does

The base Qwen2.5-1.5B-Instruct model produces reasonable answers but tends to be verbose and sometimes loses structure in multi-step reasoning. This adapter improves:

Response conciseness — ~12% shorter outputs on average, cutting fluff while retaining substance
Step-by-step structure — cleaner formatting with numbered steps and proper LaTeX math notation
Reasoning accuracy — correct answers on trick questions and logic puzzles where the base model fumbles

Training Details

Quantization

BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,
)

LoRA Configuration

LoraConfig(
    r=32,
    lora_alpha=64,
    lora_dropout=0.05,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                     "gate_proj", "up_proj", "down_proj"],
    bias="none",
    task_type="CAUSAL_LM",
)

Training Hyperparameters

Parameter	Value
Epochs	1
Batch size	1 (× 4 gradient accumulation)
Learning rate	2e-4
Scheduler	Cosine with 50-step warmup
Optimizer	Paged AdamW 8-bit
Max sequence length	2048
NEFTune noise alpha	5
Precision	fp16

Data Preprocessing — The Critical Step

The OpenThoughts-114k dataset contains DeepSeek-R1 reasoning traces with two sections:

<begin_of_thought> — thousands of tokens of raw internal reasoning
<begin_of_solution> — the clean, structured final answer

We train only on the extracted solution block. Training on the full traces causes the model to produce rambling, unfocused output. Extracting only the solution with a simple regex produced dramatically better results — same model, same hyperparameters, completely different output quality.

import re

def formatting_func(example):
    role_map = {"human": "user", "gpt": "assistant"}
    messages = []

    if example.get("system"):
        messages.append({"role": "system", "content": example["system"]})

    for turn in example["conversations"]:
        role = role_map.get(turn["from"], turn["from"])
        content = turn["value"]

        # Extract only the final solution
        if role == "assistant":
            match = re.search(
                r"<\|begin_of_solution\|>(.*?)<\|end_of_solution\|>",
                content, re.DOTALL,
            )
            if match:
                content = match.group(1).strip()

        messages.append({"role": role, "content": content})

    return tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=False)

Response Masking

Labels are padded with -100 on all non-assistant tokens using DataCollatorForSeq2Seq, so the cross-entropy loss is only computed on the tokens the model needs to generate at inference time. This improves sample efficiency — every gradient update is focused on useful generation.

Usage

Load with PEFT

from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from peft import PeftModel
import torch

# Load base model in 4-bit
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,
)

base_model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen2.5-1.5B-Instruct",
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True,
)

# Load adapter
model = PeftModel.from_pretrained(base_model, "rahmasaber/qwen2.5-iq-Finetuning-qlora")
tokenizer = AutoTokenizer.from_pretrained("rahmasaber/qwen2.5-iq-Finetuning-qlora")

model.eval()

Generate

messages = [
    {"role": "system", "content": "You are a helpful assistant that thinks step-by-step."},
    {"role": "user", "content": "If 5 machines produce 5 widgets in 5 minutes, how many minutes for 100 machines to produce 100 widgets?"},
]

prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

with torch.no_grad():
    output = model.generate(
        **inputs,
        max_new_tokens=512,
        temperature=0.7,
        top_p=0.9,
        do_sample=True,
    )

response = tokenizer.decode(output[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
print(response)

Compare Base vs Fine-Tuned

# Disable adapter → base model behavior
model.disable_adapter_layers()
base_response = generate(prompt)

# Enable adapter → fine-tuned behavior
model.enable_adapter_layers()
ft_response = generate(prompt)

Evaluation

Tested on 10 handcrafted reasoning prompts across 5 categories:

Category	# Prompts	What it tests
Logic Puzzles	2	Trick questions, careful reading
Math	3	Word problems, sequential operations
Reasoning	2	Formal logic, deductive puzzles
Code	1	Algorithm complexity analysis
Science	2	Physics principles, Archimedes

Results vs Base Model

Metric	Base	Fine-Tuned
Avg response length (tokens)	314	275 (-12%)
Correct on "all but 9 sheep"	✅	✅
Correct on average speed (harmonic mean)	✅	✅
Correct on discount stacking (32%)	✅	✅
Correct on 5 machines/5 widgets	❌	✅
Structured step-by-step format	Sometimes	Consistently

Held-Out Test Set

200 examples held out from the training sample for overfitting detection. Train/test loss gap remained healthy (< 0.5), confirming the model generalizes rather than memorizing.

Limitations

Small base model — 1.5B parameters limits complex multi-hop reasoning
1 epoch on 1.2K-3K samples — more data and epochs would improve accuracy
Self-evaluation bias — LLM-as-judge uses the same model family; use a stronger external model (GPT-4, Claude) for rigorous evaluation
Science questions — the fine-tuned model occasionally gets physics wrong (e.g., feather vs bowling ball on Moon)
No benchmark scores — not evaluated on GSM8K, MATH, or HumanEval yet

Files

.
├── adapter_config.json        # LoRA configuration
├── adapter_model.safetensors  # LoRA weights (~50MB)
├── tokenizer_config.json      # Tokenizer settings
├── tokenizer.json             # Tokenizer vocabulary
├── special_tokens_map.json    # Special token mappings
└── README.md                  # This file

Citation

@misc{saber2026qwen25qlora,
  title={QLoRA Fine-Tuning Qwen2.5-1.5B-Instruct on OpenThoughts-114k},
  author={Rahma Saber},
  year={2026},
  url={https://huggingface.co/rahmasaber/qwen2.5-iq-Finetuning-qlora}
}

Acknowledgments

Qwen Team for the base model
OpenThoughts for the reasoning dataset
Hugging Face for PEFT, TRL, and the Hub
Google Colab for free GPU access

Downloads last month: 42

Model tree for rahmasaber/qwen2.5-iq-Finetuning-qlora

Base model

Qwen/Qwen2.5-1.5B

Finetuned

Qwen/Qwen2.5-1.5B-Instruct

Adapter

(1042)

this model

rahmasaber
/

qwen2.5-iq-Finetuning-qlora