Update README.md

f0d929e verified about 1 month ago

8.15 kB

	---
	library_name: peft
	license: apache-2.0
	base_model: Qwen/Qwen2.5-1.5B-Instruct
	tags:
	- qlora
	- lora
	- fine-tuning
	- reasoning
	- qwen2.5
	- openthoughts
	- 4-bit
	- nf4
	datasets:
	- open-thoughts/OpenThoughts-114k
	language:
	- en
	pipeline_tag: text-generation
	model-index:
	- name: qwen2.5-iq-Finetuning-qlora
	results: []
	---

	# Qwen2.5-1.5B-Instruct — QLoRA Fine-Tuned on OpenThoughts-114k

	A QLoRA adapter for [Qwen/Qwen2.5-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct), fine-tuned on curated reasoning traces from [OpenThoughts-114k](https://huggingface.co/datasets/open-thoughts/OpenThoughts-114k) to produce clean, structured, step-by-step solutions.

	## Key Details

	\| \| \|
	\|---\|---\|
	\| Base Model \| Qwen/Qwen2.5-1.5B-Instruct \|
	\| Method \| QLoRA (4-bit NF4 + LoRA) \|
	\| Dataset \| 30K samples from OpenThoughts-114k \|
	\| Hardware \| Single NVIDIA T4 (16GB VRAM, free Colab) \|
	\| Adapter Size \| ~50MB \|
	\| Trainable Params \| ~1.5% of total model parameters \|

	## What This Adapter Does

	The base Qwen2.5-1.5B-Instruct model produces reasonable answers but tends to be verbose and sometimes loses structure in multi-step reasoning. This adapter improves:

	- Response conciseness — ~12% shorter outputs on average, cutting fluff while retaining substance
	- Step-by-step structure — cleaner formatting with numbered steps and proper LaTeX math notation
	- Reasoning accuracy — correct answers on trick questions and logic puzzles where the base model fumbles

	## Training Details

	### Quantization

	```
	BitsAndBytesConfig(
	load_in_4bit=True,
	bnb_4bit_quant_type="nf4",
	bnb_4bit_compute_dtype=torch.float16,
	bnb_4bit_use_double_quant=True,
	)
	```

	### LoRA Configuration

	```
	LoraConfig(
	r=32,
	lora_alpha=64,
	lora_dropout=0.05,
	target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
	"gate_proj", "up_proj", "down_proj"],
	bias="none",
	task_type="CAUSAL_LM",
	)
	```

	### Training Hyperparameters

	\| Parameter \| Value \|
	\|---\|---\|
	\| Epochs \| 1 \|
	\| Batch size \| 1 (× 4 gradient accumulation) \|
	\| Learning rate \| 2e-4 \|
	\| Scheduler \| Cosine with 50-step warmup \|
	\| Optimizer \| Paged AdamW 8-bit \|
	\| Max sequence length \| 2048 \|
	\| NEFTune noise alpha \| 5 \|
	\| Precision \| fp16 \|

	### Data Preprocessing — The Critical Step

	The OpenThoughts-114k dataset contains DeepSeek-R1 reasoning traces with two sections:
	- `<begin_of_thought>` — thousands of tokens of raw internal reasoning
	- `<begin_of_solution>` — the clean, structured final answer

	We train only on the extracted solution block. Training on the full traces causes the model to produce rambling, unfocused output. Extracting only the solution with a simple regex produced dramatically better results — same model, same hyperparameters, completely different output quality.

	```python
	import re

	def formatting_func(example):
	role_map = {"human": "user", "gpt": "assistant"}
	messages = []

	if example.get("system"):
	messages.append({"role": "system", "content": example["system"]})

	for turn in example["conversations"]:
	role = role_map.get(turn["from"], turn["from"])
	content = turn["value"]

	# Extract only the final solution
	if role == "assistant":
	match = re.search(
	r"<\\|begin_of_solution\\|>(.*?)<\\|end_of_solution\\|>",
	content, re.DOTALL,
	)
	if match:
	content = match.group(1).strip()

	messages.append({"role": role, "content": content})

	return tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=False)
	```

	### Response Masking

	Labels are padded with `-100` on all non-assistant tokens using `DataCollatorForSeq2Seq`, so the cross-entropy loss is only computed on the tokens the model needs to generate at inference time. This improves sample efficiency — every gradient update is focused on useful generation.

	## Usage

	### Load with PEFT

	```python
	from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
	from peft import PeftModel
	import torch

	# Load base model in 4-bit
	bnb_config = BitsAndBytesConfig(
	load_in_4bit=True,
	bnb_4bit_quant_type="nf4",
	bnb_4bit_compute_dtype=torch.float16,
	bnb_4bit_use_double_quant=True,
	)

	base_model = AutoModelForCausalLM.from_pretrained(
	"Qwen/Qwen2.5-1.5B-Instruct",
	quantization_config=bnb_config,
	device_map="auto",
	trust_remote_code=True,
	)

	# Load adapter
	model = PeftModel.from_pretrained(base_model, "rahmasaber/qwen2.5-iq-Finetuning-qlora")
	tokenizer = AutoTokenizer.from_pretrained("rahmasaber/qwen2.5-iq-Finetuning-qlora")

	model.eval()
	```

	### Generate

	```python
	messages = [
	{"role": "system", "content": "You are a helpful assistant that thinks step-by-step."},
	{"role": "user", "content": "If 5 machines produce 5 widgets in 5 minutes, how many minutes for 100 machines to produce 100 widgets?"},
	]

	prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
	inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

	with torch.no_grad():
	output = model.generate(
	**inputs,
	max_new_tokens=512,
	temperature=0.7,
	top_p=0.9,
	do_sample=True,
	)

	response = tokenizer.decode(output[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
	print(response)
	```

	### Compare Base vs Fine-Tuned

	```python
	# Disable adapter → base model behavior
	model.disable_adapter_layers()
	base_response = generate(prompt)

	# Enable adapter → fine-tuned behavior
	model.enable_adapter_layers()
	ft_response = generate(prompt)
	```

	## Evaluation

	Tested on 10 handcrafted reasoning prompts across 5 categories:

	\| Category \| # Prompts \| What it tests \|
	\|---\|---\|---\|
	\| Logic Puzzles \| 2 \| Trick questions, careful reading \|
	\| Math \| 3 \| Word problems, sequential operations \|
	\| Reasoning \| 2 \| Formal logic, deductive puzzles \|
	\| Code \| 1 \| Algorithm complexity analysis \|
	\| Science \| 2 \| Physics principles, Archimedes \|

	### Results vs Base Model

	\| Metric \| Base \| Fine-Tuned \|
	\|---\|---\|---\|
	\| Avg response length (tokens) \| 314 \| 275 (-12%) \|
	\| Correct on "all but 9 sheep" \| ✅ \| ✅ \|
	\| Correct on average speed (harmonic mean) \| ✅ \| ✅ \|
	\| Correct on discount stacking (32%) \| ✅ \| ✅ \|
	\| Correct on 5 machines/5 widgets \| ❌ \| ✅ \|
	\| Structured step-by-step format \| Sometimes \| Consistently \|

	### Held-Out Test Set

	200 examples held out from the training sample for overfitting detection. Train/test loss gap remained healthy (< 0.5), confirming the model generalizes rather than memorizing.

	## Limitations

	- Small base model — 1.5B parameters limits complex multi-hop reasoning
	- 1 epoch on 1.2K-3K samples — more data and epochs would improve accuracy
	- Self-evaluation bias — LLM-as-judge uses the same model family; use a stronger external model (GPT-4, Claude) for rigorous evaluation
	- Science questions — the fine-tuned model occasionally gets physics wrong (e.g., feather vs bowling ball on Moon)
	- No benchmark scores — not evaluated on GSM8K, MATH, or HumanEval yet

	## Files

	```
	.
	├── adapter_config.json # LoRA configuration
	├── adapter_model.safetensors # LoRA weights (~50MB)
	├── tokenizer_config.json # Tokenizer settings
	├── tokenizer.json # Tokenizer vocabulary
	├── special_tokens_map.json # Special token mappings
	└── README.md # This file
	```

	## Citation

	```bibtex
	@misc{saber2026qwen25qlora,
	title={QLoRA Fine-Tuning Qwen2.5-1.5B-Instruct on OpenThoughts-114k},
	author={Rahma Saber},
	year={2026},
	url={https://huggingface.co/rahmasaber/qwen2.5-iq-Finetuning-qlora}
	}
	```

	## Acknowledgments

	- [Qwen Team](https://huggingface.co/Qwen) for the base model
	- [OpenThoughts](https://huggingface.co/datasets/open-thoughts/OpenThoughts-114k) for the reasoning dataset
	- [Hugging Face](https://huggingface.co/) for PEFT, TRL, and the Hub
	- [Google Colab](https://colab.research.google.com/) for free GPU access