--- library_name: unsloth base_model: unsloth/gpt-oss-20b tags: - text-generation - unsloth - gpt-oss - lora - finetune - reasoning widget: - text: 'Solve x^5 + 3x^4 - 10 = 3.' example_title: Math Problem (Medium Reasoning) --- # Fine-tuning GPT-OSS 20B with Unsloth for Enhanced Reasoning This project demonstrates how to fine-tune the OpenAI GPT-OSS 20B model using Unsloth, an optimized library for fast and memory-efficient LLM fine-tuning. The fine-tuning focuses on improving the model's reasoning capabilities across multiple languages using the `HuggingFaceH4/Multilingual-Thinking` dataset, which contains chain-of-thought examples translated into different languages. This approach aims to equip the model with better analytical and problem-solving skills, particularly in diverse linguistic contexts. ## Model Details - **Base Model**: [unsloth/gpt-oss-20b](https://huggingface.co/unsloth/gpt-oss-20b) - A powerful 20 billion parameter model developed by OpenAI, offering advanced language understanding and generation capabilities. - **Fine-tuned with**: Unsloth LoRA adapters - Low-Rank Adaptation (LoRA) is a parameter-efficient fine-tuning technique that significantly reduces the number of trainable parameters, making fine-tuning more accessible and resource-friendly. - **Dataset**: [HuggingFaceH4/Multilingual-Thinking](https://huggingface.co/datasets/HuggingFaceH4/Multilingual-Thinking) - This dataset comprises reasoning chain-of-thought examples derived from user questions, translated from English into four other languages. It's an excellent resource for teaching models to reason in multiple linguistic settings. ## Key Features - **Unsloth Optimization**: Leveraging Unsloth's cutting-edge optimizations, this fine-tuning process achieves up to 2x faster training speeds and requires 30% less VRAM compared to standard methods. This efficiency is crucial for working with large models like GPT-OSS 20B on consumer-grade hardware or Colab's free tier. - **Reasoning Effort Control**: A unique feature of GPT-OSS models, the `reasoning_effort` parameter allows users to explicitly control the trade-off between the model's performance and response speed. You can set it to `low` for fast, direct answers, `medium` for a balanced approach, or `high` for complex tasks requiring extensive multi-step reasoning. This flexibility enables the model to adapt to various application requirements. - **Multilingual Reasoning**: By training on the `Multilingual-Thinking` dataset, the model has learned to apply reasoning processes in different languages. This enhances its utility for global applications where understanding and responding in various languages is essential. - **PEFT (LoRA)**: Parameter-Efficient Fine-Tuning (PEFT) with LoRA means that only a tiny fraction of the model's parameters (around 1%) are updated during training. This drastically reduces computational cost, memory footprint, and storage requirements for the fine-tuned model while still achieving impressive performance gains. ## Installation To reproduce this environment and run the notebook, install Unsloth and other dependencies. It is recommended to use a free Tesla T4 GPU instance on Google Colab. ```bash !pip install --upgrade -qqq uv !uv pip install -qqq "torch>=2.8.0" "triton>=3.4.0" numpy pillow torchvision bitsandbytes "transformers==4.56.2" "unsloth_zoo[base] @ git+https://github.com/unslothai/unsloth-zoo" "unsloth[base] @ git+https://github.com/unslothai/unsloth" git+https://github.com/triton-lang/triton.git@0add68262ab0a2e33b84524346cb27cbb2787356#subdirectory=python/triton_kernels !uv pip install --upgrade --no-deps transformers==4.56.2 tokenizers trl==0.22.2 unsloth unsloth_zoo ``` ## Quick Start: Inference Here's how to load the fine-tuned model and perform inference, demonstrating the use of the `reasoning_effort` parameter. This example showcases how to solve a mathematical problem with 'high' reasoning effort in French. ```python from unsloth import FastLanguageModel from transformers import TextStreamer # Load the fine-tuned model from Hugging Face Hub # The 'max_seq_length' determines the maximum token limit for input and output. # 'load_in_4bit=True' uses 4-bit quantization for reduced memory usage. model, tokenizer = FastLanguageModel.from_pretrained( model_name = "adityatiwari12/gpt_oss_lora", # Your fine-tuned model repository ID max_seq_length = 1024, dtype = None, # Auto detects the optimal data type based on your GPU load_in_4bit = True, ) # Define the conversation messages. The 'system' message sets the model's persona and language. # The 'user' message is the query for the model to solve. messages = [ {"role": "system", "content": "reasoning language: French You are a helpful assistant that can solve mathematical problems."}, {"role": "user", "content": "Solve x^5 + 3x^4 - 10 = 3."}, ] # Apply the chat template and set generation parameters. # 'add_generation_prompt=True' adds a prompt to guide the model's response. # 'reasoning_effort' is set to 'high' for a more detailed problem-solving approach. inputs = tokenizer.apply_chat_template( messages, add_generation_prompt = True, return_tensors = "pt", return_dict = True, reasoning_effort = "high", # Choose 'low', 'medium', or 'high' for different reasoning depths ).to("cuda") # Generate the model's response. 'max_new_tokens' limits the output length. # 'streamer' enables real-time token generation output. _ = model.generate(**inputs, max_new_tokens = 64, streamer = TextStreamer(tokenizer)) ``` ## Training The model was trained using the `HuggingFaceH4/Multilingual-Thinking` dataset. The training process utilized LoRA adapters for efficiency and Unsloth's `train_on_responses_only` method. This method ensures that the loss is only computed on the assistant's responses, making the fine-tuning more effective and accurate by focusing on generating correct outputs rather than mimicking the entire conversation flow. ```python from trl import SFTConfig, SFTTrainer from datasets import load_dataset from unsloth.chat_templates import standardize_sharegpt, train_on_responses_only # (Assuming model and tokenizer are already loaded and configured with LoRA adapters) # Load and preprocess the dataset from Hugging Face. The 'train' split is used for training. dataset = load_dataset("HuggingFaceH4/Multilingual-Thinking", split = "train") # Define a formatting function to structure the dataset examples into a chat template. def formatting_prompts_func(examples): convos = examples["messages"] # Apply the tokenizer's chat template to format conversations, without adding a generation prompt. texts = [tokenizer.apply_chat_template(convo, tokenize = False, add_generation_prompt = False) for convo in convos] return { "text" : texts, } # Standardize the dataset format using Unsloth's utility and then map the formatting function. dataset = standardize_sharegpt(dataset) dataset = dataset.map(formatting_prompts_func, batched = True,) # Configure and initialize the SFTTrainer for supervised fine-tuning. trainer = SFTTrainer( model = model, # The pre-trained model with LoRA adapters tokenizer = tokenizer, # The tokenizer corresponding to the model train_dataset = dataset, # The prepared training dataset args = SFTConfig( per_device_train_batch_size = 1, # Number of samples per training device gradient_accumulation_steps = 4, # Accumulate gradients over multiple steps to simulate a larger batch size warmup_steps = 5, # Number of steps for learning rate warmup max_steps = 30, # Maximum number of training steps (set `num_train_epochs` for full run) learning_rate = 2e-4, # Initial learning rate for the optimizer logging_steps = 1, # Log training metrics every N steps optim = "adamw_8bit", # 8-bit AdamW optimizer for memory efficiency weight_decay = 0.001, # L2 regularization to prevent overfitting lr_scheduler_type = "linear", # Linear learning rate scheduler seed = 3407, # Random seed for reproducibility output_dir = "outputs", # Directory to save checkpoints and logs report_to = "none", # Disable reporting to external services like Weights & Biases ), ) # Apply Unsloth's `train_on_responses_only` to mask out instruction tokens. # This ensures the model only learns from the assistant's desired responses. gpt_oss_kwargs = dict(instruction_part = "<|start|>user<|message|>", response_part = "<|start|>assistant<|channel|>final<|message|>") trainer = train_on_responses_only(trainer, **gpt_oss_kwargs) # Start the training process. trainer_stats = trainer.train() ``` ## Saving and Loading The fine-tuned LoRA adapters can be saved locally for quick deployment or pushed to the Hugging Face Hub for public sharing and version control. This allows for flexible deployment and collaboration. ```python # Save LoRA adapters locally to a specified directory. # This creates a 'gpt_oss_lora' directory containing the adapter weights. model.save_pretrained("gpt_oss_lora") # Push the fine-tuned LoRA adapters to your Hugging Face Hub repository. # Replace 'YOUR_HF_TOKEN' with your actual Hugging Face write token. # model.push_to_hub("adityatiwari12/gpt_oss_lora", token = "YOUR_HF_TOKEN") # To load the model for inference in a new Colab instance or environment: # Ensure Unsloth is installed and then load the model and tokenizer directly. # The 'model_name' should be your Hugging Face repository ID or the local directory name. # from unsloth import FastLanguageModel # model, tokenizer = FastLanguageModel.from_pretrained( # model_name = "adityatiwari12/gpt_oss_lora", # Your Hugging Face model repository or local path # max_seq_length = 1024, # dtype = None, # load_in_4bit = True, # ) ``` ## Acknowledgements - [Unsloth AI](https://unsloth.ai/) for their incredible work on optimizing LLM fine-tuning, making advanced models accessible. - [OpenAI GPT-OSS](https://github.com/openai/harmony) for the base model and its innovative reasoning capabilities and chat template. - [Hugging Face](https://huggingface.co/) for providing the datasets, libraries, and the platform for sharing models and datasets.