SmolVLA UR7e Arrange Block 100epi (10 epochs)

This repository contains a SmolVLA policy checkpoint fine-tuned with LeRobot. The model card is intentionally detailed so the training run can be reproduced or debugged from the uploaded artifact.

Model Details

Policy: SmolVLA
Base checkpoint: lerobot/smolvla_base
Training dataset: CoRL2026-CSI/UR7e-CaP_arrange_block_100epi
Training script: lerobot/scripts/train_smolvla_ur7e.sh
Checkpoint: step 5520, approximately 10.00 epochs
Reported training loss at checkpoint: 0.009
Resolved config: train_config.json

Related checkpoints from the same run:

Dataset

Key	Value
`Robot`	UR7e
`Episodes`	100
`Frames`	141,253
`Tasks`	1
`FPS`	30
`Camera streams`	`observation.images.realsense_wrist`, `observation.images.realsense_topview`
`Dataset state/action shape`	[7] / [7]

Reproduction

The uploaded train_config.json is the authoritative serialized LeRobot config for this checkpoint. The table below mirrors the key values for quick inspection.

Key	Value
`script`	lerobot/scripts/train_smolvla_ur7e.sh
`job_name`	smolvla_ur7e_arrange_block_100epi_bs64_acc4_ep10_20260509_130552
`output_dir`	/home/work/hscho/corl_2026/AutoDataCollector/lerobot/outputs/train/smolvla_ur7e_arrange_block_100epi_bs64_acc4_ep10_20260509_130552
`seed`	1000
`launch`	single-process CUDA training via `python -m lerobot.scripts.lerobot_train`
`checkpoint_step`	5520
`checkpoint_epoch`	10.00
`checkpoint_train_loss`	0.009
`checkpoint_grad_norm`	0.095
`checkpoint_lr`	2.5e-06
`effective_batch`	64 x 1 x 4 = 256

Approximate script invocation:

cd /home/work/hscho/corl_2026/AutoDataCollector/lerobot
CONDA_ENV="lerobot" POLICY_TYPE="smolvla" POLICY_PATH="lerobot/smolvla_base" DATASET_REPO_ID="CoRL2026-CSI/UR7e-CaP_arrange_block_100epi" BATCH_SIZE="64" GRADIENT_ACCUMULATION_STEPS="4" STEPS="5520" NUM_WORKERS="4" DATALOADER_PREFETCH_FACTOR="1" CUDA_VISIBLE_DEVICES="0" NUM_GPUS="1" MIXED_PRECISION="bf16" SAVE_FREQ="2760" LOG_FREQ="10" EVAL_FREQ="0" WANDB_PROJECT="lerobot-smolvla-ur7e" OMP_NUM_THREADS="4" MKL_NUM_THREADS="4" PYTORCH_CUDA_ALLOC_CONF="expandable_segments:True" bash train_smolvla_ur7e.sh

Detailed Hyperparameters

Script Defaults and Environment

Key	Value
`CONDA_ENV`	lerobot
`POLICY_TYPE`	smolvla
`POLICY_PATH`	lerobot/smolvla_base
`DATASET_REPO_ID`	CoRL2026-CSI/UR7e-CaP_arrange_block_100epi
`BATCH_SIZE`	64
`GRADIENT_ACCUMULATION_STEPS`	4
`STEPS`	5520
`NUM_WORKERS`	4
`DATALOADER_PREFETCH_FACTOR`	1
`CUDA_VISIBLE_DEVICES`	0
`NUM_GPUS`	1
`MIXED_PRECISION`	bf16
`SAVE_FREQ`	2760
`LOG_FREQ`	10
`EVAL_FREQ`	0
`WANDB_PROJECT`	lerobot-smolvla-ur7e
`OMP_NUM_THREADS`	4
`MKL_NUM_THREADS`	4
`PYTORCH_CUDA_ALLOC_CONF`	expandable_segments:True

Training Loop and Dataloader

Key	Value
`steps`	5520
`batch_size`	64
`gradient_accumulation_steps`	4
`num_workers`	4
`dataloader_prefetch_factor`	1
`dataloader_persistent_workers`	False
`dataloader_pin_memory`	True
`save_freq`	2760
`log_freq`	10
`eval_freq`	0
`cudnn_deterministic`	False
`use_policy_training_preset`	True
`ddp_find_unused_parameters`	True
`profile_timing`	False

Dataset Pipeline

Key	Value
`dataset.repo_id`	CoRL2026-CSI/UR7e-CaP_arrange_block_100epi
`dataset.root`	`null`
`dataset.episodes`	`null`
`dataset.revision`	`null`
`dataset.use_imagenet_stats`	True
`dataset.video_backend`	torchcodec
`dataset.streaming`	False

Image augmentation settings:

{
  "enable": true,
  "max_num_transforms": 2,
  "random_order": true,
  "tfs": {
    "brightness": {
      "weight": 1.0,
      "type": "ColorJitter",
      "kwargs": {
        "brightness": [
          0.8,
          1.2
        ]
      }
    },
    "contrast": {
      "weight": 1.0,
      "type": "ColorJitter",
      "kwargs": {
        "contrast": [
          0.8,
          1.2
        ]
      }
    },
    "saturation": {
      "weight": 1.0,
      "type": "ColorJitter",
      "kwargs": {
        "saturation": [
          0.5,
          1.5
        ]
      }
    },
    "hue": {
      "weight": 1.0,
      "type": "ColorJitter",
      "kwargs": {
        "hue": [
          -0.05,
          0.05
        ]
      }
    },
    "sharpness": {
      "weight": 1.0,
      "type": "SharpnessJitter",
      "kwargs": {
        "sharpness": [
          0.5,
          1.5
        ]
      }
    },
    "affine": {
      "weight": 1.0,
      "type": "RandomAffine",
      "kwargs": {
        "degrees": [
          -5.0,
          5.0
        ],
        "translate": [
          0.05,
          0.05
        ]
      }
    }
  }
}

Camera rename map:

{
  "observation.images.realsense_wrist": "observation.images.camera1",
  "observation.images.realsense_topview": "observation.images.camera2"
}

Policy Configuration

{
  "type": "smolvla",
  "pretrained_path": "lerobot/smolvla_base",
  "vlm_model_name": "HuggingFaceTB/SmolVLM2-500M-Video-Instruct",
  "load_vlm_weights": true,
  "num_vlm_layers": 16,
  "freeze_vision_encoder": true,
  "train_expert_only": true,
  "train_state_proj": true,
  "use_peft": false,
  "use_amp": false,
  "chunk_size": 50,
  "n_action_steps": 50,
  "num_steps": 10,
  "max_state_dim": 32,
  "max_action_dim": 32,
  "resize_imgs_with_padding": [
    512,
    512
  ],
  "tokenizer_max_length": 48,
  "attention_mode": "cross_attn",
  "pad_language_to": "max_length",
  "use_cache": true,
  "num_expert_layers": 0,
  "expert_width_multiplier": 0.75,
  "self_attn_every_n_layers": 2,
  "min_period": 0.004,
  "max_period": 4.0,
  "compile_model": false,
  "compile_mode": "max-autotune",
  "normalization_mapping": {
    "VISUAL": "IDENTITY",
    "STATE": "MEAN_STD",
    "ACTION": "MEAN_STD"
  },
  "input_features": {
    "observation.state": {
      "type": "STATE",
      "shape": [
        6
      ]
    },
    "observation.images.camera1": {
      "type": "VISUAL",
      "shape": [
        3,
        256,
        256
      ]
    },
    "observation.images.camera2": {
      "type": "VISUAL",
      "shape": [
        3,
        256,
        256
      ]
    },
    "observation.images.camera3": {
      "type": "VISUAL",
      "shape": [
        3,
        256,
        256
      ]
    }
  },
  "output_features": {
    "action": {
      "type": "ACTION",
      "shape": [
        7
      ]
    }
  }
}

Optimizer

{
  "type": "adamw",
  "lr": 0.0001,
  "weight_decay": 1e-10,
  "grad_clip_norm": 10.0,
  "betas": [
    0.9,
    0.95
  ],
  "eps": 1e-08
}

Scheduler

{
  "type": "cosine_decay_with_warmup",
  "num_warmup_steps": 1000,
  "num_decay_steps": 30000,
  "peak_lr": 0.0001,
  "decay_lr": 2.5e-06
}

Logging

{
  "enable": true,
  "disable_artifact": false,
  "project": "lerobot-smolvla-ur7e",
  "entity": null,
  "notes": null,
  "run_id": "e1h98rll",
  "mode": null
}

Usage

Use this model as a LeRobot policy checkpoint:

python -m lerobot.scripts.lerobot_eval \
  --policy.path=CoRL2026-CSI/smolvla_ur7e_arrange_block_100epi_10ep

For Python loading inside LeRobot code, use the SmolVLA policy loader with this repository id as the pretrained path.

Evaluation and Limitations

This model card reports training checkpoint information only. No rollout success rate or task-level evaluation metric is included in this repository.

The checkpoint assumes a compatible observation/action schema and the camera remapping shown above. The optimizer/RNG training_state files are not included; only the loadable pretrained_model artifact is uploaded.

Provenance

VLM backbone: HuggingFaceTB/SmolVLM2-500M-Video-Instruct
Fine-tuning run: smolvla_ur7e_arrange_block_100epi_bs64_acc4_ep10_20260509_130552
Source training script: lerobot/scripts/train_smolvla_ur7e.sh

Downloads last month: 3

Safetensors

Model size

0.5B params

Tensor type

F32

BF16

Video Preview

Robotics

Model tree for CoRL2026-CSI/smolvla_ur7e_arrange_block_100epi_10ep

Base model

lerobot/smolvla_base

Finetuned

(6269)

this model

CoRL2026-CSI
/

smolvla_ur7e_arrange_block_100epi_10ep