Instructions to use rahmasaber/qwen2.5-iq-Finetuning-qlora with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use rahmasaber/qwen2.5-iq-Finetuning-qlora with PEFT:
from peft import PeftModel from transformers import AutoModelForCausalLM base_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-1.5B-Instruct") model = PeftModel.from_pretrained(base_model, "rahmasaber/qwen2.5-iq-Finetuning-qlora") - Notebooks
- Google Colab
- Kaggle
Qwen2.5-1.5B-Instruct β QLoRA Fine-Tuned on OpenThoughts-114k
A QLoRA adapter for Qwen/Qwen2.5-1.5B-Instruct, fine-tuned on curated reasoning traces from OpenThoughts-114k to produce clean, structured, step-by-step solutions.
Key Details
| Base Model | Qwen/Qwen2.5-1.5B-Instruct |
| Method | QLoRA (4-bit NF4 + LoRA) |
| Dataset | 30K samples from OpenThoughts-114k |
| Hardware | Single NVIDIA T4 (16GB VRAM, free Colab) |
| Adapter Size | ~50MB |
| Trainable Params | ~1.5% of total model parameters |
What This Adapter Does
The base Qwen2.5-1.5B-Instruct model produces reasonable answers but tends to be verbose and sometimes loses structure in multi-step reasoning. This adapter improves:
- Response conciseness β ~12% shorter outputs on average, cutting fluff while retaining substance
- Step-by-step structure β cleaner formatting with numbered steps and proper LaTeX math notation
- Reasoning accuracy β correct answers on trick questions and logic puzzles where the base model fumbles
Training Details
Quantization
BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
)
LoRA Configuration
LoraConfig(
r=32,
lora_alpha=64,
lora_dropout=0.05,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"],
bias="none",
task_type="CAUSAL_LM",
)
Training Hyperparameters
| Parameter | Value |
|---|---|
| Epochs | 1 |
| Batch size | 1 (Γ 4 gradient accumulation) |
| Learning rate | 2e-4 |
| Scheduler | Cosine with 50-step warmup |
| Optimizer | Paged AdamW 8-bit |
| Max sequence length | 2048 |
| NEFTune noise alpha | 5 |
| Precision | fp16 |
Data Preprocessing β The Critical Step
The OpenThoughts-114k dataset contains DeepSeek-R1 reasoning traces with two sections:
<begin_of_thought>β thousands of tokens of raw internal reasoning<begin_of_solution>β the clean, structured final answer
We train only on the extracted solution block. Training on the full traces causes the model to produce rambling, unfocused output. Extracting only the solution with a simple regex produced dramatically better results β same model, same hyperparameters, completely different output quality.
import re
def formatting_func(example):
role_map = {"human": "user", "gpt": "assistant"}
messages = []
if example.get("system"):
messages.append({"role": "system", "content": example["system"]})
for turn in example["conversations"]:
role = role_map.get(turn["from"], turn["from"])
content = turn["value"]
# Extract only the final solution
if role == "assistant":
match = re.search(
r"<\|begin_of_solution\|>(.*?)<\|end_of_solution\|>",
content, re.DOTALL,
)
if match:
content = match.group(1).strip()
messages.append({"role": role, "content": content})
return tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=False)
Response Masking
Labels are padded with -100 on all non-assistant tokens using DataCollatorForSeq2Seq, so the cross-entropy loss is only computed on the tokens the model needs to generate at inference time. This improves sample efficiency β every gradient update is focused on useful generation.
Usage
Load with PEFT
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from peft import PeftModel
import torch
# Load base model in 4-bit
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
)
base_model = AutoModelForCausalLM.from_pretrained(
"Qwen/Qwen2.5-1.5B-Instruct",
quantization_config=bnb_config,
device_map="auto",
trust_remote_code=True,
)
# Load adapter
model = PeftModel.from_pretrained(base_model, "rahmasaber/qwen2.5-iq-Finetuning-qlora")
tokenizer = AutoTokenizer.from_pretrained("rahmasaber/qwen2.5-iq-Finetuning-qlora")
model.eval()
Generate
messages = [
{"role": "system", "content": "You are a helpful assistant that thinks step-by-step."},
{"role": "user", "content": "If 5 machines produce 5 widgets in 5 minutes, how many minutes for 100 machines to produce 100 widgets?"},
]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
output = model.generate(
**inputs,
max_new_tokens=512,
temperature=0.7,
top_p=0.9,
do_sample=True,
)
response = tokenizer.decode(output[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
print(response)
Compare Base vs Fine-Tuned
# Disable adapter β base model behavior
model.disable_adapter_layers()
base_response = generate(prompt)
# Enable adapter β fine-tuned behavior
model.enable_adapter_layers()
ft_response = generate(prompt)
Evaluation
Tested on 10 handcrafted reasoning prompts across 5 categories:
| Category | # Prompts | What it tests |
|---|---|---|
| Logic Puzzles | 2 | Trick questions, careful reading |
| Math | 3 | Word problems, sequential operations |
| Reasoning | 2 | Formal logic, deductive puzzles |
| Code | 1 | Algorithm complexity analysis |
| Science | 2 | Physics principles, Archimedes |
Results vs Base Model
| Metric | Base | Fine-Tuned |
|---|---|---|
| Avg response length (tokens) | 314 | 275 (-12%) |
| Correct on "all but 9 sheep" | β | β |
| Correct on average speed (harmonic mean) | β | β |
| Correct on discount stacking (32%) | β | β |
| Correct on 5 machines/5 widgets | β | β |
| Structured step-by-step format | Sometimes | Consistently |
Held-Out Test Set
200 examples held out from the training sample for overfitting detection. Train/test loss gap remained healthy (< 0.5), confirming the model generalizes rather than memorizing.
Limitations
- Small base model β 1.5B parameters limits complex multi-hop reasoning
- 1 epoch on 1.2K-3K samples β more data and epochs would improve accuracy
- Self-evaluation bias β LLM-as-judge uses the same model family; use a stronger external model (GPT-4, Claude) for rigorous evaluation
- Science questions β the fine-tuned model occasionally gets physics wrong (e.g., feather vs bowling ball on Moon)
- No benchmark scores β not evaluated on GSM8K, MATH, or HumanEval yet
Files
.
βββ adapter_config.json # LoRA configuration
βββ adapter_model.safetensors # LoRA weights (~50MB)
βββ tokenizer_config.json # Tokenizer settings
βββ tokenizer.json # Tokenizer vocabulary
βββ special_tokens_map.json # Special token mappings
βββ README.md # This file
Citation
@misc{saber2026qwen25qlora,
title={QLoRA Fine-Tuning Qwen2.5-1.5B-Instruct on OpenThoughts-114k},
author={Rahma Saber},
year={2026},
url={https://huggingface.co/rahmasaber/qwen2.5-iq-Finetuning-qlora}
}
Acknowledgments
- Qwen Team for the base model
- OpenThoughts for the reasoning dataset
- Hugging Face for PEFT, TRL, and the Hub
- Google Colab for free GPU access
- Downloads last month
- 42