--- base_model: - Rakushaking/Qwen4b-SFT-d9 datasets: - u-10bei/dpo-dataset-qwen-cot language: - en license: apache-2.0 library_name: transformers pipeline_tag: text-generation tags: - dpo - unsloth - qwen - alignment - structured-data - chain-of-thought --- # Qwen3-4B SFT + General DPO (Round 1) This model is a fine-tuned version of **Rakushaking/Qwen4b-SFT-d9** using **Direct Preference Optimization (DPO)** via the **Unsloth** library. This repository contains the **full-merged 16-bit weights**. No adapter loading is required. ## Training Objective This model has been optimized using DPO with a general-purpose preference dataset (u-10bei/dpo-dataset-qwen-cot) to improve overall Chain-of-Thought reasoning quality and structured output consistency. This round focuses on **broad output quality improvement** rather than format-specific corrections. ## Training Pipeline 1. **Base**: Qwen/Qwen3-4B-Instruct-2507 2. **SFT**: Structured data generation/conversion with Chain-of-Thought (V4+V5, ~9.3k samples) 3. **DPO Round 1 (this model)**: Generic preference optimization (u-10bei/dpo-dataset-qwen-cot, ~4,040 pairs) ## DPO Data - **Dataset**: u-10bei/dpo-dataset-qwen-cot - **Pairs**: ~4,040 preference pairs - **Content**: General Chain-of-Thought quality preferences — chosen responses exhibit better reasoning structure and output accuracy compared to rejected responses ## Training Configuration - **Base model**: Rakushaking/Qwen4b-SFT-d9-merged - **Method**: DPO (Direct Preference Optimization) - **Epochs**: 1 - **Learning rate**: 5e-07 - **Beta**: 0.1 - **Max sequence length**: 1024 - **Max prompt length**: 512 - **LoRA Config**: r=8, alpha=16 (merged into base) ## DPO Training Log | Step | Loss | Accuracy | Margin | |------|------|----------|--------| | 50 | 0.685 | 54.9% | 0.022 | | 100 | 0.643 | 78.9% | 0.108 | | 150 | 0.591 | 92.8% | 0.223 | | 200 | 0.543 | 98.6% | 0.336 | The model successfully learned to distinguish preferred from rejected outputs, reaching 98.6% accuracy by Step 200. ## Performance (Parse Success Rate on public_150 benchmark) | Format | Baseline (pre-train) | After SFT | After DPO Round 1 (this model) | |--------|---------------------|-----------|-------------------------------| | JSON | 100% | 100% | **100%** | | CSV | 100% | 100% | **100%** | | YAML | 97% | 94% | **94%** | | XML | 80% | 95% | **90%** | | TOML | 72% | 48% | **52%** | | **Total** | **92.0%** | **89.3%** | **89.3%** | ## Key Observations ### Parse rate vs Evaluation score While the parse success rate remained at 89.3% (same as SFT), **this model achieved the highest evaluation score at the time of training**. This indicates that the general DPO improved aspects not captured by parse rate alone: **1. CoT reasoning quality improvement** General preference optimization refined the model's Chain-of-Thought reasoning, producing more logical and well-structured analysis before generating output. **2. Output structural accuracy** Even when outputs parse successfully, their internal structure (correct field mapping, appropriate nesting, data type handling) improved through preference learning. **3. TOML: 48% → 52% (+4pt)** Modest improvement in TOML parse rate. Inline table usage remained high (20/25 tasks), indicating that general DPO alone cannot fully resolve format-specific issues. ### Limitation identified General DPO data does not contain structured-data-specific preferences (e.g. TOML inline table vs section style). This limitation is addressed in DPO Round 2 with format-specific preference pairs. ## Usage Since this is a merged model, you can use it directly with `transformers`. ```python from transformers import AutoModelForCausalLM, AutoTokenizer import torch model_id = "Rakushaking/Qwen4b-SFT-d9-merged-after-dpo-d2" tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModelForCausalLM.from_pretrained( model_id, torch_dtype=torch.float16, device_map="auto" ) prompt = "Convert the following JSON to YAML:\n{\"name\": \"test\", \"value\": 42}" inputs = tokenizer.apply_chat_template( [{"role": "system", "content": "You are an expert in YAML format."}, {"role": "user", "content": prompt}], tokenize=True, add_generation_prompt=True, return_tensors="pt" ).to("cuda") outputs = model.generate(**inputs, max_new_tokens=512) print(tokenizer.decode(outputs[0]))