---
library_name: peft
license: apache-2.0
base_model: Qwen/Qwen2.5-1.5B-Instruct
tags:
  - qlora
  - lora
  - fine-tuning
  - reasoning
  - qwen2.5
  - openthoughts
  - 4-bit
  - nf4
datasets:
  - open-thoughts/OpenThoughts-114k
language:
  - en
pipeline_tag: text-generation
model-index:
  - name: qwen2.5-iq-Finetuning-qlora
    results: []
---

# Qwen2.5-1.5B-Instruct — QLoRA Fine-Tuned on OpenThoughts-114k

A QLoRA adapter for [Qwen/Qwen2.5-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct), fine-tuned on curated reasoning traces from [OpenThoughts-114k](https://huggingface.co/datasets/open-thoughts/OpenThoughts-114k) to produce clean, structured, step-by-step solutions.

## Key Details

| | |
|---|---|
| **Base Model** | Qwen/Qwen2.5-1.5B-Instruct |
| **Method** | QLoRA (4-bit NF4 + LoRA) |
| **Dataset** | 30K samples from OpenThoughts-114k |
| **Hardware** | Single NVIDIA T4 (16GB VRAM, free Colab) |
| **Adapter Size** | ~50MB |
| **Trainable Params** | ~1.5% of total model parameters |

## What This Adapter Does

The base Qwen2.5-1.5B-Instruct model produces reasonable answers but tends to be verbose and sometimes loses structure in multi-step reasoning. This adapter improves:

- **Response conciseness** — ~12% shorter outputs on average, cutting fluff while retaining substance
- **Step-by-step structure** — cleaner formatting with numbered steps and proper LaTeX math notation
- **Reasoning accuracy** — correct answers on trick questions and logic puzzles where the base model fumbles

## Training Details

### Quantization

```
BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,
)
```

### LoRA Configuration

```
LoraConfig(
    r=32,
    lora_alpha=64,
    lora_dropout=0.05,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                     "gate_proj", "up_proj", "down_proj"],
    bias="none",
    task_type="CAUSAL_LM",
)
```

### Training Hyperparameters

| Parameter | Value |
|---|---|
| Epochs | 1 |
| Batch size | 1 (× 4 gradient accumulation) |
| Learning rate | 2e-4 |
| Scheduler | Cosine with 50-step warmup |
| Optimizer | Paged AdamW 8-bit |
| Max sequence length | 2048 |
| NEFTune noise alpha | 5 |
| Precision | fp16 |

### Data Preprocessing — The Critical Step

The OpenThoughts-114k dataset contains DeepSeek-R1 reasoning traces with two sections:
- `<begin_of_thought>` — thousands of tokens of raw internal reasoning
- `<begin_of_solution>` — the clean, structured final answer

**We train only on the extracted solution block.** Training on the full traces causes the model to produce rambling, unfocused output. Extracting only the solution with a simple regex produced dramatically better results — same model, same hyperparameters, completely different output quality.

```python
import re

def formatting_func(example):
    role_map = {"human": "user", "gpt": "assistant"}
    messages = []

    if example.get("system"):
        messages.append({"role": "system", "content": example["system"]})

    for turn in example["conversations"]:
        role = role_map.get(turn["from"], turn["from"])
        content = turn["value"]

        # Extract only the final solution
        if role == "assistant":
            match = re.search(
                r"<\|begin_of_solution\|>(.*?)<\|end_of_solution\|>",
                content, re.DOTALL,
            )
            if match:
                content = match.group(1).strip()

        messages.append({"role": role, "content": content})

    return tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=False)
```

### Response Masking

Labels are padded with `-100` on all non-assistant tokens using `DataCollatorForSeq2Seq`, so the cross-entropy loss is only computed on the tokens the model needs to generate at inference time. This improves sample efficiency — every gradient update is focused on useful generation.

## Usage

### Load with PEFT

```python
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from peft import PeftModel
import torch

# Load base model in 4-bit
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,
)

base_model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen2.5-1.5B-Instruct",
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True,
)

# Load adapter
model = PeftModel.from_pretrained(base_model, "rahmasaber/qwen2.5-iq-Finetuning-qlora")
tokenizer = AutoTokenizer.from_pretrained("rahmasaber/qwen2.5-iq-Finetuning-qlora")

model.eval()
```

### Generate

```python
messages = [
    {"role": "system", "content": "You are a helpful assistant that thinks step-by-step."},
    {"role": "user", "content": "If 5 machines produce 5 widgets in 5 minutes, how many minutes for 100 machines to produce 100 widgets?"},
]

prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

with torch.no_grad():
    output = model.generate(
        **inputs,
        max_new_tokens=512,
        temperature=0.7,
        top_p=0.9,
        do_sample=True,
    )

response = tokenizer.decode(output[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
print(response)
```

### Compare Base vs Fine-Tuned

```python
# Disable adapter → base model behavior
model.disable_adapter_layers()
base_response = generate(prompt)

# Enable adapter → fine-tuned behavior
model.enable_adapter_layers()
ft_response = generate(prompt)
```

## Evaluation

Tested on 10 handcrafted reasoning prompts across 5 categories:

| Category | # Prompts | What it tests |
|---|---|---|
| Logic Puzzles | 2 | Trick questions, careful reading |
| Math | 3 | Word problems, sequential operations |
| Reasoning | 2 | Formal logic, deductive puzzles |
| Code | 1 | Algorithm complexity analysis |
| Science | 2 | Physics principles, Archimedes |

### Results vs Base Model

| Metric | Base | Fine-Tuned |
|---|---|---|
| Avg response length (tokens) | 314 | 275 (-12%) |
| Correct on "all but 9 sheep" | ✅ | ✅ |
| Correct on average speed (harmonic mean) | ✅ | ✅ |
| Correct on discount stacking (32%) | ✅ | ✅ |
| Correct on 5 machines/5 widgets | ❌ | ✅ |
| Structured step-by-step format | Sometimes | Consistently |

### Held-Out Test Set

200 examples held out from the training sample for overfitting detection. Train/test loss gap remained healthy (< 0.5), confirming the model generalizes rather than memorizing.

## Limitations

- **Small base model** — 1.5B parameters limits complex multi-hop reasoning
- **1 epoch on 1.2K-3K samples** — more data and epochs would improve accuracy
- **Self-evaluation bias** — LLM-as-judge uses the same model family; use a stronger external model (GPT-4, Claude) for rigorous evaluation
- **Science questions** — the fine-tuned model occasionally gets physics wrong (e.g., feather vs bowling ball on Moon)
- **No benchmark scores** — not evaluated on GSM8K, MATH, or HumanEval yet

## Files

```
.
├── adapter_config.json        # LoRA configuration
├── adapter_model.safetensors  # LoRA weights (~50MB)
├── tokenizer_config.json      # Tokenizer settings
├── tokenizer.json             # Tokenizer vocabulary
├── special_tokens_map.json    # Special token mappings
└── README.md                  # This file
```

## Citation

```bibtex
@misc{saber2026qwen25qlora,
  title={QLoRA Fine-Tuning Qwen2.5-1.5B-Instruct on OpenThoughts-114k},
  author={Rahma Saber},
  year={2026},
  url={https://huggingface.co/rahmasaber/qwen2.5-iq-Finetuning-qlora}
}
```

## Acknowledgments

- [Qwen Team](https://huggingface.co/Qwen) for the base model
- [OpenThoughts](https://huggingface.co/datasets/open-thoughts/OpenThoughts-114k) for the reasoning dataset
- [Hugging Face](https://huggingface.co/) for PEFT, TRL, and the Hub
- [Google Colab](https://colab.research.google.com/) for free GPU access