Fine-tuning GPT-OSS 20B with Unsloth for Enhanced Reasoning

This project demonstrates how to fine-tune the OpenAI GPT-OSS 20B model using Unsloth, an optimized library for fast and memory-efficient LLM fine-tuning. The fine-tuning focuses on improving the model's reasoning capabilities across multiple languages using the HuggingFaceH4/Multilingual-Thinking dataset, which contains chain-of-thought examples translated into different languages. This approach aims to equip the model with better analytical and problem-solving skills, particularly in diverse linguistic contexts.

Model Details

Base Model: unsloth/gpt-oss-20b - A powerful 20 billion parameter model developed by OpenAI, offering advanced language understanding and generation capabilities.
Fine-tuned with: Unsloth LoRA adapters - Low-Rank Adaptation (LoRA) is a parameter-efficient fine-tuning technique that significantly reduces the number of trainable parameters, making fine-tuning more accessible and resource-friendly.
Dataset: HuggingFaceH4/Multilingual-Thinking - This dataset comprises reasoning chain-of-thought examples derived from user questions, translated from English into four other languages. It's an excellent resource for teaching models to reason in multiple linguistic settings.

Key Features

Unsloth Optimization: Leveraging Unsloth's cutting-edge optimizations, this fine-tuning process achieves up to 2x faster training speeds and requires 30% less VRAM compared to standard methods. This efficiency is crucial for working with large models like GPT-OSS 20B on consumer-grade hardware or Colab's free tier.
Reasoning Effort Control: A unique feature of GPT-OSS models, the reasoning_effort parameter allows users to explicitly control the trade-off between the model's performance and response speed. You can set it to low for fast, direct answers, medium for a balanced approach, or high for complex tasks requiring extensive multi-step reasoning. This flexibility enables the model to adapt to various application requirements.
Multilingual Reasoning: By training on the Multilingual-Thinking dataset, the model has learned to apply reasoning processes in different languages. This enhances its utility for global applications where understanding and responding in various languages is essential.
PEFT (LoRA): Parameter-Efficient Fine-Tuning (PEFT) with LoRA means that only a tiny fraction of the model's parameters (around 1%) are updated during training. This drastically reduces computational cost, memory footprint, and storage requirements for the fine-tuned model while still achieving impressive performance gains.

Installation

To reproduce this environment and run the notebook, install Unsloth and other dependencies. It is recommended to use a free Tesla T4 GPU instance on Google Colab.

!pip install --upgrade -qqq uv
!uv pip install -qqq     "torch>=2.8.0" "triton>=3.4.0" numpy pillow torchvision bitsandbytes "transformers==4.56.2"     "unsloth_zoo[base] @ git+https://github.com/unslothai/unsloth-zoo"     "unsloth[base] @ git+https://github.com/unslothai/unsloth"     git+https://github.com/triton-lang/triton.git@0add68262ab0a2e33b84524346cb27cbb2787356#subdirectory=python/triton_kernels
!uv pip install --upgrade --no-deps transformers==4.56.2 tokenizers trl==0.22.2 unsloth unsloth_zoo

Quick Start: Inference

Here's how to load the fine-tuned model and perform inference, demonstrating the use of the reasoning_effort parameter. This example showcases how to solve a mathematical problem with 'high' reasoning effort in French.

from unsloth import FastLanguageModel
from transformers import TextStreamer

# Load the fine-tuned model from Hugging Face Hub
# The 'max_seq_length' determines the maximum token limit for input and output.
# 'load_in_4bit=True' uses 4-bit quantization for reduced memory usage.
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "adityatiwari12/gpt_oss_lora", # Your fine-tuned model repository ID
    max_seq_length = 1024,
    dtype = None, # Auto detects the optimal data type based on your GPU
    load_in_4bit = True,
)

# Define the conversation messages. The 'system' message sets the model's persona and language.
# The 'user' message is the query for the model to solve.
messages = [
    {"role": "system", "content": "reasoning language: French

You are a helpful assistant that can solve mathematical problems."},
    {"role": "user", "content": "Solve x^5 + 3x^4 - 10 = 3."},
]

# Apply the chat template and set generation parameters.
# 'add_generation_prompt=True' adds a prompt to guide the model's response.
# 'reasoning_effort' is set to 'high' for a more detailed problem-solving approach.
inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt = True,
    return_tensors = "pt",
    return_dict = True,
    reasoning_effort = "high", # Choose 'low', 'medium', or 'high' for different reasoning depths
).to("cuda")

# Generate the model's response. 'max_new_tokens' limits the output length.
# 'streamer' enables real-time token generation output.
_ = model.generate(**inputs, max_new_tokens = 64, streamer = TextStreamer(tokenizer))

Training

The model was trained using the HuggingFaceH4/Multilingual-Thinking dataset. The training process utilized LoRA adapters for efficiency and Unsloth's train_on_responses_only method. This method ensures that the loss is only computed on the assistant's responses, making the fine-tuning more effective and accurate by focusing on generating correct outputs rather than mimicking the entire conversation flow.

from trl import SFTConfig, SFTTrainer
from datasets import load_dataset
from unsloth.chat_templates import standardize_sharegpt, train_on_responses_only

# (Assuming model and tokenizer are already loaded and configured with LoRA adapters)
# Load and preprocess the dataset from Hugging Face. The 'train' split is used for training.
dataset = load_dataset("HuggingFaceH4/Multilingual-Thinking", split = "train")

# Define a formatting function to structure the dataset examples into a chat template.
def formatting_prompts_func(examples):
    convos = examples["messages"]
    # Apply the tokenizer's chat template to format conversations, without adding a generation prompt.
    texts = [tokenizer.apply_chat_template(convo, tokenize = False, add_generation_prompt = False) for convo in convos]
    return { "text" : texts, }

# Standardize the dataset format using Unsloth's utility and then map the formatting function.
dataset = standardize_sharegpt(dataset)
dataset = dataset.map(formatting_prompts_func, batched = True,)

# Configure and initialize the SFTTrainer for supervised fine-tuning.
trainer = SFTTrainer(
    model = model, # The pre-trained model with LoRA adapters
    tokenizer = tokenizer, # The tokenizer corresponding to the model
    train_dataset = dataset, # The prepared training dataset
    args = SFTConfig(
        per_device_train_batch_size = 1, # Number of samples per training device
        gradient_accumulation_steps = 4, # Accumulate gradients over multiple steps to simulate a larger batch size
        warmup_steps = 5, # Number of steps for learning rate warmup
        max_steps = 30, # Maximum number of training steps (set `num_train_epochs` for full run)
        learning_rate = 2e-4, # Initial learning rate for the optimizer
        logging_steps = 1, # Log training metrics every N steps
        optim = "adamw_8bit", # 8-bit AdamW optimizer for memory efficiency
        weight_decay = 0.001, # L2 regularization to prevent overfitting
        lr_scheduler_type = "linear", # Linear learning rate scheduler
        seed = 3407, # Random seed for reproducibility
        output_dir = "outputs", # Directory to save checkpoints and logs
        report_to = "none", # Disable reporting to external services like Weights & Biases
    ),
)

# Apply Unsloth's `train_on_responses_only` to mask out instruction tokens.
# This ensures the model only learns from the assistant's desired responses.
gpt_oss_kwargs = dict(instruction_part = "<|start|>user<|message|>", response_part = "<|start|>assistant<|channel|>final<|message|>")
trainer = train_on_responses_only(trainer, **gpt_oss_kwargs)

# Start the training process.
trainer_stats = trainer.train()

Saving and Loading

The fine-tuned LoRA adapters can be saved locally for quick deployment or pushed to the Hugging Face Hub for public sharing and version control. This allows for flexible deployment and collaboration.

# Save LoRA adapters locally to a specified directory.
# This creates a 'gpt_oss_lora' directory containing the adapter weights.
model.save_pretrained("gpt_oss_lora")

# Push the fine-tuned LoRA adapters to your Hugging Face Hub repository.
# Replace 'YOUR_HF_TOKEN' with your actual Hugging Face write token.
# model.push_to_hub("adityatiwari12/gpt_oss_lora", token = "YOUR_HF_TOKEN")

# To load the model for inference in a new Colab instance or environment:
# Ensure Unsloth is installed and then load the model and tokenizer directly.
# The 'model_name' should be your Hugging Face repository ID or the local directory name.
# from unsloth import FastLanguageModel
# model, tokenizer = FastLanguageModel.from_pretrained(
#     model_name = "adityatiwari12/gpt_oss_lora", # Your Hugging Face model repository or local path
#     max_seq_length = 1024,
#     dtype = None,
#     load_in_4bit = True,
# )

Acknowledgements

Unsloth AI for their incredible work on optimizing LLM fine-tuning, making advanced models accessible.
OpenAI GPT-OSS for the base model and its innovative reasoning capabilities and chat template.
Hugging Face for providing the datasets, libraries, and the platform for sharing models and datasets.

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for adityatiwari12/gpt_oss_lora

Base model

openai/gpt-oss-20b

Quantized

unsloth/gpt-oss-20b

Adapter

(56)

this model