Fine-tuning GPT-OSS 20B with Unsloth for Enhanced Reasoning

This project demonstrates how to fine-tune the OpenAI GPT-OSS 20B model using Unsloth, an optimized library for fast and memory-efficient LLM fine-tuning. The fine-tuning focuses on improving the model's reasoning capabilities across multiple languages using the HuggingFaceH4/Multilingual-Thinking dataset, which contains chain-of-thought examples translated into different languages. This approach aims to equip the model with better analytical and problem-solving skills, particularly in diverse linguistic contexts.

Model Details

  • Base Model: unsloth/gpt-oss-20b - A powerful 20 billion parameter model developed by OpenAI, offering advanced language understanding and generation capabilities.
  • Fine-tuned with: Unsloth LoRA adapters - Low-Rank Adaptation (LoRA) is a parameter-efficient fine-tuning technique that significantly reduces the number of trainable parameters, making fine-tuning more accessible and resource-friendly.
  • Dataset: HuggingFaceH4/Multilingual-Thinking - This dataset comprises reasoning chain-of-thought examples derived from user questions, translated from English into four other languages. It's an excellent resource for teaching models to reason in multiple linguistic settings.

Key Features

  • Unsloth Optimization: Leveraging Unsloth's cutting-edge optimizations, this fine-tuning process achieves up to 2x faster training speeds and requires 30% less VRAM compared to standard methods. This efficiency is crucial for working with large models like GPT-OSS 20B on consumer-grade hardware or Colab's free tier.
  • Reasoning Effort Control: A unique feature of GPT-OSS models, the reasoning_effort parameter allows users to explicitly control the trade-off between the model's performance and response speed. You can set it to low for fast, direct answers, medium for a balanced approach, or high for complex tasks requiring extensive multi-step reasoning. This flexibility enables the model to adapt to various application requirements.
  • Multilingual Reasoning: By training on the Multilingual-Thinking dataset, the model has learned to apply reasoning processes in different languages. This enhances its utility for global applications where understanding and responding in various languages is essential.
  • PEFT (LoRA): Parameter-Efficient Fine-Tuning (PEFT) with LoRA means that only a tiny fraction of the model's parameters (around 1%) are updated during training. This drastically reduces computational cost, memory footprint, and storage requirements for the fine-tuned model while still achieving impressive performance gains.

Installation

To reproduce this environment and run the notebook, install Unsloth and other dependencies. It is recommended to use a free Tesla T4 GPU instance on Google Colab.

!pip install --upgrade -qqq uv
!uv pip install -qqq     "torch>=2.8.0" "triton>=3.4.0" numpy pillow torchvision bitsandbytes "transformers==4.56.2"     "unsloth_zoo[base] @ git+https://github.com/unslothai/unsloth-zoo"     "unsloth[base] @ git+https://github.com/unslothai/unsloth"     git+https://github.com/triton-lang/triton.git@0add68262ab0a2e33b84524346cb27cbb2787356#subdirectory=python/triton_kernels
!uv pip install --upgrade --no-deps transformers==4.56.2 tokenizers trl==0.22.2 unsloth unsloth_zoo

Quick Start: Inference

Here's how to load the fine-tuned model and perform inference, demonstrating the use of the reasoning_effort parameter. This example showcases how to solve a mathematical problem with 'high' reasoning effort in French.

from unsloth import FastLanguageModel
from transformers import TextStreamer

# Load the fine-tuned model from Hugging Face Hub
# The 'max_seq_length' determines the maximum token limit for input and output.
# 'load_in_4bit=True' uses 4-bit quantization for reduced memory usage.
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "adityatiwari12/gpt_oss_lora", # Your fine-tuned model repository ID
    max_seq_length = 1024,
    dtype = None, # Auto detects the optimal data type based on your GPU
    load_in_4bit = True,
)

# Define the conversation messages. The 'system' message sets the model's persona and language.
# The 'user' message is the query for the model to solve.
messages = [
    {"role": "system", "content": "reasoning language: French

You are a helpful assistant that can solve mathematical problems."},
    {"role": "user", "content": "Solve x^5 + 3x^4 - 10 = 3."},
]

# Apply the chat template and set generation parameters.
# 'add_generation_prompt=True' adds a prompt to guide the model's response.
# 'reasoning_effort' is set to 'high' for a more detailed problem-solving approach.
inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt = True,
    return_tensors = "pt",
    return_dict = True,
    reasoning_effort = "high", # Choose 'low', 'medium', or 'high' for different reasoning depths
).to("cuda")

# Generate the model's response. 'max_new_tokens' limits the output length.
# 'streamer' enables real-time token generation output.
_ = model.generate(**inputs, max_new_tokens = 64, streamer = TextStreamer(tokenizer))

Training

The model was trained using the HuggingFaceH4/Multilingual-Thinking dataset. The training process utilized LoRA adapters for efficiency and Unsloth's train_on_responses_only method. This method ensures that the loss is only computed on the assistant's responses, making the fine-tuning more effective and accurate by focusing on generating correct outputs rather than mimicking the entire conversation flow.

from trl import SFTConfig, SFTTrainer
from datasets import load_dataset
from unsloth.chat_templates import standardize_sharegpt, train_on_responses_only

# (Assuming model and tokenizer are already loaded and configured with LoRA adapters)
# Load and preprocess the dataset from Hugging Face. The 'train' split is used for training.
dataset = load_dataset("HuggingFaceH4/Multilingual-Thinking", split = "train")

# Define a formatting function to structure the dataset examples into a chat template.
def formatting_prompts_func(examples):
    convos = examples["messages"]
    # Apply the tokenizer's chat template to format conversations, without adding a generation prompt.
    texts = [tokenizer.apply_chat_template(convo, tokenize = False, add_generation_prompt = False) for convo in convos]
    return { "text" : texts, }

# Standardize the dataset format using Unsloth's utility and then map the formatting function.
dataset = standardize_sharegpt(dataset)
dataset = dataset.map(formatting_prompts_func, batched = True,)

# Configure and initialize the SFTTrainer for supervised fine-tuning.
trainer = SFTTrainer(
    model = model, # The pre-trained model with LoRA adapters
    tokenizer = tokenizer, # The tokenizer corresponding to the model
    train_dataset = dataset, # The prepared training dataset
    args = SFTConfig(
        per_device_train_batch_size = 1, # Number of samples per training device
        gradient_accumulation_steps = 4, # Accumulate gradients over multiple steps to simulate a larger batch size
        warmup_steps = 5, # Number of steps for learning rate warmup
        max_steps = 30, # Maximum number of training steps (set `num_train_epochs` for full run)
        learning_rate = 2e-4, # Initial learning rate for the optimizer
        logging_steps = 1, # Log training metrics every N steps
        optim = "adamw_8bit", # 8-bit AdamW optimizer for memory efficiency
        weight_decay = 0.001, # L2 regularization to prevent overfitting
        lr_scheduler_type = "linear", # Linear learning rate scheduler
        seed = 3407, # Random seed for reproducibility
        output_dir = "outputs", # Directory to save checkpoints and logs
        report_to = "none", # Disable reporting to external services like Weights & Biases
    ),
)

# Apply Unsloth's `train_on_responses_only` to mask out instruction tokens.
# This ensures the model only learns from the assistant's desired responses.
gpt_oss_kwargs = dict(instruction_part = "<|start|>user<|message|>", response_part = "<|start|>assistant<|channel|>final<|message|>")
trainer = train_on_responses_only(trainer, **gpt_oss_kwargs)

# Start the training process.
trainer_stats = trainer.train()

Saving and Loading

The fine-tuned LoRA adapters can be saved locally for quick deployment or pushed to the Hugging Face Hub for public sharing and version control. This allows for flexible deployment and collaboration.

# Save LoRA adapters locally to a specified directory.
# This creates a 'gpt_oss_lora' directory containing the adapter weights.
model.save_pretrained("gpt_oss_lora")

# Push the fine-tuned LoRA adapters to your Hugging Face Hub repository.
# Replace 'YOUR_HF_TOKEN' with your actual Hugging Face write token.
# model.push_to_hub("adityatiwari12/gpt_oss_lora", token = "YOUR_HF_TOKEN")

# To load the model for inference in a new Colab instance or environment:
# Ensure Unsloth is installed and then load the model and tokenizer directly.
# The 'model_name' should be your Hugging Face repository ID or the local directory name.
# from unsloth import FastLanguageModel
# model, tokenizer = FastLanguageModel.from_pretrained(
#     model_name = "adityatiwari12/gpt_oss_lora", # Your Hugging Face model repository or local path
#     max_seq_length = 1024,
#     dtype = None,
#     load_in_4bit = True,
# )

Acknowledgements

  • Unsloth AI for their incredible work on optimizing LLM fine-tuning, making advanced models accessible.
  • OpenAI GPT-OSS for the base model and its innovative reasoning capabilities and chat template.
  • Hugging Face for providing the datasets, libraries, and the platform for sharing models and datasets.
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for adityatiwari12/gpt_oss_lora

Adapter
(56)
this model