---
base_model:
- Rakushaking/Qwen4b-SFT-d9
datasets:
- u-10bei/dpo-dataset-qwen-cot
language:
- en
license: apache-2.0
library_name: transformers
pipeline_tag: text-generation
tags:
- dpo
- unsloth
- qwen
- alignment
- structured-data
- chain-of-thought
---

# Qwen3-4B SFT + General DPO (Round 1)

This model is a fine-tuned version of **Rakushaking/Qwen4b-SFT-d9** using **Direct Preference Optimization (DPO)** via the **Unsloth** library.

This repository contains the **full-merged 16-bit weights**. No adapter loading is required.

## Training Objective
This model has been optimized using DPO with a general-purpose preference dataset (u-10bei/dpo-dataset-qwen-cot) to improve overall Chain-of-Thought reasoning quality and structured output consistency. This round focuses on **broad output quality improvement** rather than format-specific corrections.

## Training Pipeline
1. **Base**: Qwen/Qwen3-4B-Instruct-2507
2. **SFT**: Structured data generation/conversion with Chain-of-Thought (V4+V5, ~9.3k samples)
3. **DPO Round 1 (this model)**: Generic preference optimization (u-10bei/dpo-dataset-qwen-cot, ~4,040 pairs)

## DPO Data
- **Dataset**: u-10bei/dpo-dataset-qwen-cot
- **Pairs**: ~4,040 preference pairs
- **Content**: General Chain-of-Thought quality preferences — chosen responses exhibit better reasoning structure and output accuracy compared to rejected responses

## Training Configuration
- **Base model**: Rakushaking/Qwen4b-SFT-d9-merged
- **Method**: DPO (Direct Preference Optimization)
- **Epochs**: 1
- **Learning rate**: 5e-07
- **Beta**: 0.1
- **Max sequence length**: 1024
- **Max prompt length**: 512
- **LoRA Config**: r=8, alpha=16 (merged into base)

## DPO Training Log

| Step | Loss | Accuracy | Margin |
|------|------|----------|--------|
| 50   | 0.685 | 54.9% | 0.022 |
| 100  | 0.643 | 78.9% | 0.108 |
| 150  | 0.591 | 92.8% | 0.223 |
| 200  | 0.543 | 98.6% | 0.336 |

The model successfully learned to distinguish preferred from rejected outputs, reaching 98.6% accuracy by Step 200.

## Performance (Parse Success Rate on public_150 benchmark)

| Format | Baseline (pre-train) | After SFT | After DPO Round 1 (this model) |
|--------|---------------------|-----------|-------------------------------|
| JSON   | 100%                | 100%      | **100%**                      |
| CSV    | 100%                | 100%      | **100%**                      |
| YAML   | 97%                 | 94%       | **94%**                       |
| XML    | 80%                 | 95%       | **90%**                       |
| TOML   | 72%                 | 48%       | **52%**                       |
| **Total** | **92.0%**        | **89.3%** | **89.3%**                     |

## Key Observations

### Parse rate vs Evaluation score
While the parse success rate remained at 89.3% (same as SFT), **this model achieved the highest evaluation score at the time of training**. This indicates that the general DPO improved aspects not captured by parse rate alone:

**1. CoT reasoning quality improvement**
General preference optimization refined the model's Chain-of-Thought reasoning, producing more logical and well-structured analysis before generating output.

**2. Output structural accuracy**
Even when outputs parse successfully, their internal structure (correct field mapping, appropriate nesting, data type handling) improved through preference learning.

**3. TOML: 48% → 52% (+4pt)**
Modest improvement in TOML parse rate. Inline table usage remained high (20/25 tasks), indicating that general DPO alone cannot fully resolve format-specific issues.

### Limitation identified
General DPO data does not contain structured-data-specific preferences (e.g. TOML inline table vs section style). This limitation is addressed in DPO Round 2 with format-specific preference pairs.

## Usage
Since this is a merged model, you can use it directly with `transformers`.
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "Rakushaking/Qwen4b-SFT-d9-merged-after-dpo-d2"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.float16,
    device_map="auto"
)

prompt = "Convert the following JSON to YAML:\n{\"name\": \"test\", \"value\": 42}"
inputs = tokenizer.apply_chat_template(
    [{"role": "system", "content": "You are an expert in YAML format."},
     {"role": "user", "content": prompt}],
    tokenize=True, add_generation_prompt=True, return_tensors="pt"
).to("cuda")
outputs = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0]))