---
base_model: Rakushaking/Qwen4b-SFT-d9-merged-after-dpo-d2
datasets:
- Rakushaking/structured-data-v45-d2
- u-10bei/dpo-dataset-qwen-cot
language:
- en
license: apache-2.0
library_name: transformers
pipeline_tag: text-generation
tags:
- dpo
- unsloth
- qwen
- alignment
- structured-data
- toml
- json
- yaml
- xml
- csv
---

# Qwen3-4B SFT + General DPO + Specialized DPO

This model is a fine-tuned version of **Rakushaking/Qwen4b-SFT-d9-merged-after-dpo-d2** using **Direct Preference Optimization (DPO)** via the **Unsloth** library.

This repository contains the **full-merged 16-bit weights**. No adapter loading is required.

## Training Objective
This model has been optimized using DPO to align its responses with preferred outputs, focusing on:
- **TOML**: Prefer `[section]` / `[[array]]` style over inline table `{ key = val }` style (500 pairs)
- **YAML**: Prefer correct indentation over broken indent / tab / extra text (200 pairs)
- **XML**: Prefer well-formed XML over unclosed tags / unescaped & / codeblock wrapping (200 pairs)
- **JSON**: Prefer raw JSON output over codeblock-wrapped output (150 pairs)
- **CSV**: Prefer correct headers over mismatched / missing headers (100 pairs)

## Training Pipeline
1. **Base**: Qwen/Qwen3-4B-Instruct-2507
2. **SFT**: Structured data generation/conversion with Chain-of-Thought (V4+V5)
3. **DPO Round 1**: Generic preference optimization (u-10bei/dpo-dataset-qwen-cot)
4. **DPO Round 2 (this model)**: Format-specific preference optimization with programmatically generated chosen/rejected pairs (~1,150 pairs)

## DPO Data Construction
Rejected outputs were **programmatically generated** (not LLM-generated) to ensure consistent quality:
- TOML: Correct TOML parsed via `toml.loads()` then re-serialized as inline tables
- YAML: Indentation randomly corrupted / tabs inserted / explanation text prepended
- XML: Closing tags removed / `&` left unescaped / wrapped in markdown codeblocks
- JSON: Wrapped in markdown codeblocks / explanation text prepended
- CSV: Header columns added or removed / explanation text prepended
- All pairs filtered to estimated 900 tokens or less to fit within max_length=1024 constraint

## Training Configuration
- **Base model**: Rakushaking/Qwen4b-SFT-d9-merged-after-dpo-d2
- **Method**: DPO (Direct Preference Optimization)
- **Epochs**: 1
- **Learning rate**: 1e-05
- **Beta**: 0.1
- **Max sequence length**: 1024
- **Max prompt length**: 384
- **LoRA Config**: r=8, alpha=16 (merged into base)

## Performance (Parse Success Rate on public_150 benchmark)

| Format | Baseline (pre-train) | After SFT | After DPO Round 2 (this model) |
|--------|---------------------|-----------|-------------------------------|
| JSON   | 100%                | 100%      | **100%**                      |
| CSV    | 100%                | 100%      | **100%**                      |
| YAML   | 97%                 | 94%       | **94%**                       |
| XML    | 80%                 | 95%       | **90%**                       |
| TOML   | 72%                 | 48%       | **64%**                       |
| **Total** | **92.0%**        | **89.3%** | **91.3%**                     |

## Improvements over Baseline

While the overall parse success rate is comparable (BL 92.0% vs Final 91.3%), this model achieves **higher evaluation scores** than the baseline due to the following qualitative improvements:

### What improved over Baseline

**1. XML: 80% → 90% (+10pt)**
SFT provided extensive training on correct XML structures (tag nesting, repeated elements for arrays). DPO further reinforced well-formed XML by explicitly rejecting outputs with mismatched tags, unescaped `&` characters, and codeblock wrapping.

**2. Output cleanliness**

| Issue | Baseline | This model |
|-------|----------|------------|
| Markdown codeblock wrapping (` ```json ... ``` `) | 8 tasks | 4 tasks |
| Explanation text mixed in ("Here's the converted...") | 3 tasks | 0 tasks |
| Language mixing (Chinese characters in English output) | 1 task | 0 tasks |
| Factual errors (e.g. Rosetta Stone wrong era/discoverer) | Multiple | 0 (switched to fictional data) |

**3. Chain-of-Thought reasoning quality**
The baseline produced raw output without reasoning. This model generates structured CoT before each output:
- Structure analysis ("It represents a dictionary with N root fields...")
- Format rule identification ("TOML requires strict syntax...")
- Step-by-step conversion plan

This improves the structural accuracy of the generated data, which is reflected in evaluation scores even when parse rates are similar.

**4. Factual error avoidance**
The baseline attempted to generate real-world data (e.g. Rosetta Stone) and produced numerous factual errors (wrong era, wrong discoverer, wrong dimensions). This model generates fictional/synthetic data instead, avoiding factual inaccuracies entirely.

### What decreased vs Baseline

| Format | Baseline | This model | Reason |
|--------|----------|------------|--------|
| YAML | 97% | 94% (-3pt) | Minor indentation issues in complex nested structures |
| TOML | 72% | 64% (-8pt) | BL used inline tables (shallow structures happened to parse); this model uses [section] style which is more correct but fails on complex nesting |

Note: The TOML decrease in parse rate does not reflect a decrease in output quality. The baseline's 72% relied on inline tables `{ key = val }` which happened to parse for shallow structures but would fail for deeply nested data. This model intentionally uses `[section]` / `[[array]]` style which is the proper TOML convention.

## Usage
Since this is a merged model, you can use it directly with `transformers`.
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "Rakushaking/Qwen4b-SFT-d9-merged-after-dpo-toml-xml-yaml-dpo"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.float16,
    device_map="auto"
)

prompt = "Your question here"
inputs = tokenizer.apply_chat_template(
    [{"role": "user", "content": prompt}],
    tokenize=True, add_generation_prompt=True, return_tensors="pt"
).to("cuda")
outputs = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0]))
```

## Sources & License (IMPORTANT)

* **SFT Data**: u-10bei/structured_data_with_cot_dataset_512_v4, v5 
* **DPO Round 1 Data**: u-10bei/dpo-dataset-qwen-cot
* **DPO Round 2 Data**: Self-generated preference pairs from SFT training data