rahmasaber's picture
Update README.md
f0d929e verified
|
Raw
History Blame Contribute Delete
8.15 kB
---
library_name: peft
license: apache-2.0
base_model: Qwen/Qwen2.5-1.5B-Instruct
tags:
- qlora
- lora
- fine-tuning
- reasoning
- qwen2.5
- openthoughts
- 4-bit
- nf4
datasets:
- open-thoughts/OpenThoughts-114k
language:
- en
pipeline_tag: text-generation
model-index:
- name: qwen2.5-iq-Finetuning-qlora
results: []
---
# Qwen2.5-1.5B-Instruct β€” QLoRA Fine-Tuned on OpenThoughts-114k
A QLoRA adapter for [Qwen/Qwen2.5-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct), fine-tuned on curated reasoning traces from [OpenThoughts-114k](https://huggingface.co/datasets/open-thoughts/OpenThoughts-114k) to produce clean, structured, step-by-step solutions.
## Key Details
| | |
|---|---|
| **Base Model** | Qwen/Qwen2.5-1.5B-Instruct |
| **Method** | QLoRA (4-bit NF4 + LoRA) |
| **Dataset** | 30K samples from OpenThoughts-114k |
| **Hardware** | Single NVIDIA T4 (16GB VRAM, free Colab) |
| **Adapter Size** | ~50MB |
| **Trainable Params** | ~1.5% of total model parameters |
## What This Adapter Does
The base Qwen2.5-1.5B-Instruct model produces reasonable answers but tends to be verbose and sometimes loses structure in multi-step reasoning. This adapter improves:
- **Response conciseness** β€” ~12% shorter outputs on average, cutting fluff while retaining substance
- **Step-by-step structure** β€” cleaner formatting with numbered steps and proper LaTeX math notation
- **Reasoning accuracy** β€” correct answers on trick questions and logic puzzles where the base model fumbles
## Training Details
### Quantization
```
BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
)
```
### LoRA Configuration
```
LoraConfig(
r=32,
lora_alpha=64,
lora_dropout=0.05,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"],
bias="none",
task_type="CAUSAL_LM",
)
```
### Training Hyperparameters
| Parameter | Value |
|---|---|
| Epochs | 1 |
| Batch size | 1 (Γ— 4 gradient accumulation) |
| Learning rate | 2e-4 |
| Scheduler | Cosine with 50-step warmup |
| Optimizer | Paged AdamW 8-bit |
| Max sequence length | 2048 |
| NEFTune noise alpha | 5 |
| Precision | fp16 |
### Data Preprocessing β€” The Critical Step
The OpenThoughts-114k dataset contains DeepSeek-R1 reasoning traces with two sections:
- `<begin_of_thought>` β€” thousands of tokens of raw internal reasoning
- `<begin_of_solution>` β€” the clean, structured final answer
**We train only on the extracted solution block.** Training on the full traces causes the model to produce rambling, unfocused output. Extracting only the solution with a simple regex produced dramatically better results β€” same model, same hyperparameters, completely different output quality.
```python
import re
def formatting_func(example):
role_map = {"human": "user", "gpt": "assistant"}
messages = []
if example.get("system"):
messages.append({"role": "system", "content": example["system"]})
for turn in example["conversations"]:
role = role_map.get(turn["from"], turn["from"])
content = turn["value"]
# Extract only the final solution
if role == "assistant":
match = re.search(
r"<\|begin_of_solution\|>(.*?)<\|end_of_solution\|>",
content, re.DOTALL,
)
if match:
content = match.group(1).strip()
messages.append({"role": role, "content": content})
return tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=False)
```
### Response Masking
Labels are padded with `-100` on all non-assistant tokens using `DataCollatorForSeq2Seq`, so the cross-entropy loss is only computed on the tokens the model needs to generate at inference time. This improves sample efficiency β€” every gradient update is focused on useful generation.
## Usage
### Load with PEFT
```python
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from peft import PeftModel
import torch
# Load base model in 4-bit
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
)
base_model = AutoModelForCausalLM.from_pretrained(
"Qwen/Qwen2.5-1.5B-Instruct",
quantization_config=bnb_config,
device_map="auto",
trust_remote_code=True,
)
# Load adapter
model = PeftModel.from_pretrained(base_model, "rahmasaber/qwen2.5-iq-Finetuning-qlora")
tokenizer = AutoTokenizer.from_pretrained("rahmasaber/qwen2.5-iq-Finetuning-qlora")
model.eval()
```
### Generate
```python
messages = [
{"role": "system", "content": "You are a helpful assistant that thinks step-by-step."},
{"role": "user", "content": "If 5 machines produce 5 widgets in 5 minutes, how many minutes for 100 machines to produce 100 widgets?"},
]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
output = model.generate(
**inputs,
max_new_tokens=512,
temperature=0.7,
top_p=0.9,
do_sample=True,
)
response = tokenizer.decode(output[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
print(response)
```
### Compare Base vs Fine-Tuned
```python
# Disable adapter β†’ base model behavior
model.disable_adapter_layers()
base_response = generate(prompt)
# Enable adapter β†’ fine-tuned behavior
model.enable_adapter_layers()
ft_response = generate(prompt)
```
## Evaluation
Tested on 10 handcrafted reasoning prompts across 5 categories:
| Category | # Prompts | What it tests |
|---|---|---|
| Logic Puzzles | 2 | Trick questions, careful reading |
| Math | 3 | Word problems, sequential operations |
| Reasoning | 2 | Formal logic, deductive puzzles |
| Code | 1 | Algorithm complexity analysis |
| Science | 2 | Physics principles, Archimedes |
### Results vs Base Model
| Metric | Base | Fine-Tuned |
|---|---|---|
| Avg response length (tokens) | 314 | 275 (-12%) |
| Correct on "all but 9 sheep" | βœ… | βœ… |
| Correct on average speed (harmonic mean) | βœ… | βœ… |
| Correct on discount stacking (32%) | βœ… | βœ… |
| Correct on 5 machines/5 widgets | ❌ | βœ… |
| Structured step-by-step format | Sometimes | Consistently |
### Held-Out Test Set
200 examples held out from the training sample for overfitting detection. Train/test loss gap remained healthy (< 0.5), confirming the model generalizes rather than memorizing.
## Limitations
- **Small base model** β€” 1.5B parameters limits complex multi-hop reasoning
- **1 epoch on 1.2K-3K samples** β€” more data and epochs would improve accuracy
- **Self-evaluation bias** β€” LLM-as-judge uses the same model family; use a stronger external model (GPT-4, Claude) for rigorous evaluation
- **Science questions** β€” the fine-tuned model occasionally gets physics wrong (e.g., feather vs bowling ball on Moon)
- **No benchmark scores** β€” not evaluated on GSM8K, MATH, or HumanEval yet
## Files
```
.
β”œβ”€β”€ adapter_config.json # LoRA configuration
β”œβ”€β”€ adapter_model.safetensors # LoRA weights (~50MB)
β”œβ”€β”€ tokenizer_config.json # Tokenizer settings
β”œβ”€β”€ tokenizer.json # Tokenizer vocabulary
β”œβ”€β”€ special_tokens_map.json # Special token mappings
└── README.md # This file
```
## Citation
```bibtex
@misc{saber2026qwen25qlora,
title={QLoRA Fine-Tuning Qwen2.5-1.5B-Instruct on OpenThoughts-114k},
author={Rahma Saber},
year={2026},
url={https://huggingface.co/rahmasaber/qwen2.5-iq-Finetuning-qlora}
}
```
## Acknowledgments
- [Qwen Team](https://huggingface.co/Qwen) for the base model
- [OpenThoughts](https://huggingface.co/datasets/open-thoughts/OpenThoughts-114k) for the reasoning dataset
- [Hugging Face](https://huggingface.co/) for PEFT, TRL, and the Hub
- [Google Colab](https://colab.research.google.com/) for free GPU access