--- base_model: Qwen/Qwen3-4B-Instruct-2507 datasets: - daichira/structured-3k-mix-sft language: - en license: apache-2.0 library_name: transformers tags: - grpo - reinforcement-learning - trl - structured-output - sft-to-grpo --- # Qwen3 4B Structured Output GRPO Model This repository contains a **full model** (merged weights) trained using **GRPO (Group Relative Policy Optimization)**. The model is a result of a two-stage training process: 1. **SFT (Supervised Fine-Tuning)**: Initial instruction tuning (likely using QLoRA) on `Qwen/Qwen3-4B-Instruct-2507`. 2. **GRPO**: Reinforcement learning on the SFT model to optimize for structured outputs (JSON/XML/etc.) and content quality. ## Model Lineage - **Original Base Model**: `Qwen/Qwen3-4B-Instruct-2507` - **SFT Adapter Source**: `./outputs/lora_structeval_t_qwen3_4b_additional_prompt_5e-6` (Combined into base before GRPO) - **Training Method**: GRPO (on top of Merged SFT Model) ## Training Objective This model is optimized for **structured reasoning and output**, rewarding: - Correct parsing of structured formats (e.g., JSON). - Quality of the content within the structure. ## Training Configuration (GRPO Stage) - **Dataset**: `daichira/structured-3k-mix-sft` - **Epochs**: 3 - **Learning Rate**: 5e-07 - **LoRA Configuration**: r=32, alpha=64 (Applied during GRPO and merged into final model) ## Usage Since this is a full merged model, you can load it directly with Transformers: ```python from transformers import AutoModelForCausalLM, AutoTokenizer import torch model_name = "zerg2187/GRPO_structeval_t_qwen3_v1" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForCausalLM.from_pretrained( model_name, torch_dtype=torch.float16, # or bfloat16 device_map="auto" ) prompt = "Generate a JSON object describing a book." messages = [ {"role": "user", "content": prompt} ] text = tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True ) outputs = model.generate( **tokenizer(text, return_tensors="pt").to(model.device), max_new_tokens=512 ) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) ``` ## License Compliance: This model must be used in compliance with the original `Qwen/Qwen3-4B-Instruct-2507` license and the dataset license.