Qwen2.5-1.5B-Instruct β€” QLoRA Fine-Tuned on OpenThoughts-114k

A QLoRA adapter for Qwen/Qwen2.5-1.5B-Instruct, fine-tuned on curated reasoning traces from OpenThoughts-114k to produce clean, structured, step-by-step solutions.

Key Details

Base Model Qwen/Qwen2.5-1.5B-Instruct
Method QLoRA (4-bit NF4 + LoRA)
Dataset 30K samples from OpenThoughts-114k
Hardware Single NVIDIA T4 (16GB VRAM, free Colab)
Adapter Size ~50MB
Trainable Params ~1.5% of total model parameters

What This Adapter Does

The base Qwen2.5-1.5B-Instruct model produces reasonable answers but tends to be verbose and sometimes loses structure in multi-step reasoning. This adapter improves:

  • Response conciseness β€” ~12% shorter outputs on average, cutting fluff while retaining substance
  • Step-by-step structure β€” cleaner formatting with numbered steps and proper LaTeX math notation
  • Reasoning accuracy β€” correct answers on trick questions and logic puzzles where the base model fumbles

Training Details

Quantization

BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,
)

LoRA Configuration

LoraConfig(
    r=32,
    lora_alpha=64,
    lora_dropout=0.05,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                     "gate_proj", "up_proj", "down_proj"],
    bias="none",
    task_type="CAUSAL_LM",
)

Training Hyperparameters

Parameter Value
Epochs 1
Batch size 1 (Γ— 4 gradient accumulation)
Learning rate 2e-4
Scheduler Cosine with 50-step warmup
Optimizer Paged AdamW 8-bit
Max sequence length 2048
NEFTune noise alpha 5
Precision fp16

Data Preprocessing β€” The Critical Step

The OpenThoughts-114k dataset contains DeepSeek-R1 reasoning traces with two sections:

  • <begin_of_thought> β€” thousands of tokens of raw internal reasoning
  • <begin_of_solution> β€” the clean, structured final answer

We train only on the extracted solution block. Training on the full traces causes the model to produce rambling, unfocused output. Extracting only the solution with a simple regex produced dramatically better results β€” same model, same hyperparameters, completely different output quality.

import re

def formatting_func(example):
    role_map = {"human": "user", "gpt": "assistant"}
    messages = []

    if example.get("system"):
        messages.append({"role": "system", "content": example["system"]})

    for turn in example["conversations"]:
        role = role_map.get(turn["from"], turn["from"])
        content = turn["value"]

        # Extract only the final solution
        if role == "assistant":
            match = re.search(
                r"<\|begin_of_solution\|>(.*?)<\|end_of_solution\|>",
                content, re.DOTALL,
            )
            if match:
                content = match.group(1).strip()

        messages.append({"role": role, "content": content})

    return tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=False)

Response Masking

Labels are padded with -100 on all non-assistant tokens using DataCollatorForSeq2Seq, so the cross-entropy loss is only computed on the tokens the model needs to generate at inference time. This improves sample efficiency β€” every gradient update is focused on useful generation.

Usage

Load with PEFT

from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from peft import PeftModel
import torch

# Load base model in 4-bit
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,
)

base_model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen2.5-1.5B-Instruct",
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True,
)

# Load adapter
model = PeftModel.from_pretrained(base_model, "rahmasaber/qwen2.5-iq-Finetuning-qlora")
tokenizer = AutoTokenizer.from_pretrained("rahmasaber/qwen2.5-iq-Finetuning-qlora")

model.eval()

Generate

messages = [
    {"role": "system", "content": "You are a helpful assistant that thinks step-by-step."},
    {"role": "user", "content": "If 5 machines produce 5 widgets in 5 minutes, how many minutes for 100 machines to produce 100 widgets?"},
]

prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

with torch.no_grad():
    output = model.generate(
        **inputs,
        max_new_tokens=512,
        temperature=0.7,
        top_p=0.9,
        do_sample=True,
    )

response = tokenizer.decode(output[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
print(response)

Compare Base vs Fine-Tuned

# Disable adapter β†’ base model behavior
model.disable_adapter_layers()
base_response = generate(prompt)

# Enable adapter β†’ fine-tuned behavior
model.enable_adapter_layers()
ft_response = generate(prompt)

Evaluation

Tested on 10 handcrafted reasoning prompts across 5 categories:

Category # Prompts What it tests
Logic Puzzles 2 Trick questions, careful reading
Math 3 Word problems, sequential operations
Reasoning 2 Formal logic, deductive puzzles
Code 1 Algorithm complexity analysis
Science 2 Physics principles, Archimedes

Results vs Base Model

Metric Base Fine-Tuned
Avg response length (tokens) 314 275 (-12%)
Correct on "all but 9 sheep" βœ… βœ…
Correct on average speed (harmonic mean) βœ… βœ…
Correct on discount stacking (32%) βœ… βœ…
Correct on 5 machines/5 widgets ❌ βœ…
Structured step-by-step format Sometimes Consistently

Held-Out Test Set

200 examples held out from the training sample for overfitting detection. Train/test loss gap remained healthy (< 0.5), confirming the model generalizes rather than memorizing.

Limitations

  • Small base model β€” 1.5B parameters limits complex multi-hop reasoning
  • 1 epoch on 1.2K-3K samples β€” more data and epochs would improve accuracy
  • Self-evaluation bias β€” LLM-as-judge uses the same model family; use a stronger external model (GPT-4, Claude) for rigorous evaluation
  • Science questions β€” the fine-tuned model occasionally gets physics wrong (e.g., feather vs bowling ball on Moon)
  • No benchmark scores β€” not evaluated on GSM8K, MATH, or HumanEval yet

Files

.
β”œβ”€β”€ adapter_config.json        # LoRA configuration
β”œβ”€β”€ adapter_model.safetensors  # LoRA weights (~50MB)
β”œβ”€β”€ tokenizer_config.json      # Tokenizer settings
β”œβ”€β”€ tokenizer.json             # Tokenizer vocabulary
β”œβ”€β”€ special_tokens_map.json    # Special token mappings
└── README.md                  # This file

Citation

@misc{saber2026qwen25qlora,
  title={QLoRA Fine-Tuning Qwen2.5-1.5B-Instruct on OpenThoughts-114k},
  author={Rahma Saber},
  year={2026},
  url={https://huggingface.co/rahmasaber/qwen2.5-iq-Finetuning-qlora}
}

Acknowledgments

Downloads last month
42
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for rahmasaber/qwen2.5-iq-Finetuning-qlora

Adapter
(1042)
this model

Dataset used to train rahmasaber/qwen2.5-iq-Finetuning-qlora