Built with Axolotl

See axolotl config

axolotl version: 0.13.0.dev0

base_model: mistralai/Mistral-Nemo-Instruct-2407

##uncomment for 2 GPU. More than two require more settings.
#deepspeed: deepspeed_configs/zero1.json

# Model quantization for qLoRA
bnb_config_kwargs:
  bnb_4bit_compute_dtype: bfloat16
  bnb_4bit_quant_type: nf4
  bnb_4bit_use_double_quant: true

seed: 42 # do not change
val_set_size: 0.01 # Use 1% of the dataset for validation; no pre-split in dataset
## For other datasets set to ratio based on dataset size, 100k - 0.01, ...,  100 - 0.05
datasets:
  - path: TeamPV/distractors-onr-v2
    split: train
    type: chat_template
    conversation: messages  # Your dataset has 'messages' field

chat_template: tokenizer_default # Use model's built-in chat template

eval_sample_packing: false # Only 70b model can handle this
eval_batch_size: 14 # TUNE THIS to achieve ~70+ GB CRAM usage on H100 (often same value as micro_batch_size in pre-trainer config)
evals_per_epoch: 5
# early_stopping_patience: 3


# Tokenization
sequence_len: 3000 # CRITICAL to check
pad_to_sequence_len: true
sample_packing: false # this will make small models go insane.

special_tokens:
  pad_token: "</s>"

# LoRA/DoRA
adapter: lora
lora_r: 32 # 70B will require 128. Memory cost, workarounds exist.
lora_alpha: 64 # 2x r
lora_dropout: 0.05
lora_target_modules: # This is basic full coverage. For LLAMA use unsloth.
  - q_proj
  - k_proj
  - v_proj
  - o_proj
  - up_proj
  - down_proj
  - gate_proj
peft_use_dora: false # 2x slower training, but allowed to drop r x4
output_dir: /model_out/mistral-nemo-12b_sft # change this
use_tensorboard: true

# Training
micro_batch_size: 9 # TUNE THIS to achieve ~70+ GB VRAM usage on H100
gradient_accumulation_steps: 1 # Not worth it under 12B on h100. 70B will be mandatory.
num_epochs: 4 # SFT is 4-5
learning_rate: 0.00005
lr_scheduler: cosine
warmup_ratio: 0.10

# Optimizer
# optimizer: adamw_torch_fused
optimizer: adamw_bnb_8bit
bf16: true
fp16: false
tf32: true # H100 parameter

# Attention
flash_attention: true

# Memory
gradient_checkpointing: true
gradient_checkpointing_kwargs:
  use_reentrant: false

# Checkpointing
save_first_step: true
saves_per_epoch: 2
save_total_limit: 10
load_best_model_at_end: true

# Logging
logging_steps: 50

# HuggingFace Hub upload
hub_model_id: TeamPV/mistral-nemo-onr-sft  # ALWAYS CHANGE
hub_strategy: every_save  # Options: end, every_save, checkpoint, all_checkpoints
hf_use_auth_token: true

mistral-nemo-onr-sft

This model is a fine-tuned version of mistralai/Mistral-Nemo-Instruct-2407 on the TeamPV/distractors-onr-v2 dataset. It achieves the following results on the evaluation set:

  • Loss: 1.0149
  • Memory/max Active (gib): 77.11
  • Memory/max Allocated (gib): 77.11
  • Memory/device Reserved (gib): 77.96

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 5e-05
  • train_batch_size: 9
  • eval_batch_size: 14
  • seed: 42
  • optimizer: Use OptimizerNames.ADAMW_BNB with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
  • lr_scheduler_type: cosine
  • lr_scheduler_warmup_steps: 4393
  • training_steps: 43939

Training results

Training Loss Epoch Step Validation Loss Active (gib) Allocated (gib) Reserved (gib)
No log 0 0 1.9760 76.86 76.86 77.68
1.1451 0.2 2197 1.1194 77.11 77.11 77.96
1.0682 0.4 4394 1.0709 77.11 77.11 77.96
1.0512 0.6 6591 1.0371 77.11 77.11 77.96
1.0213 0.8 8788 1.0147 77.11 77.11 77.96
1.0041 1.0 10985 0.9990 77.11 77.11 77.96
0.9459 1.2 13182 0.9950 77.11 77.11 77.96
0.9329 1.4 15379 0.9897 77.11 77.11 77.96
0.9445 1.6 17576 0.9783 77.11 77.11 77.96
0.9434 1.8 19773 0.9706 77.11 77.11 77.96
0.88 2.0 21970 0.9620 77.11 77.11 77.96
0.8008 2.2 24167 0.9877 77.11 77.11 77.96
0.7725 2.4 26364 0.9867 77.11 77.11 77.96
0.781 2.6 28561 0.9801 77.11 77.11 77.96
0.7722 2.8 30758 0.9785 77.11 77.11 77.96
0.7704 3.0 32955 0.9736 77.11 77.11 77.96
0.6672 3.2 35152 1.0137 77.11 77.11 77.96
0.6657 3.4 37349 1.0155 77.11 77.11 77.96
0.6744 3.6 39546 1.0152 77.11 77.11 77.96
0.6398 3.8 41743 1.0149 77.11 77.11 77.96

Framework versions

  • PEFT 0.17.1
  • Transformers 4.57.1
  • Pytorch 2.8.0+cu128
  • Datasets 4.0.0
  • Tokenizers 0.22.1
Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for TeamPV/mistral-nemo-onr-sft-singleGPU