CoRL2026-CSI
/

smolvla_ur7e_arrange_block_100epi_10ep

@@ -8,73 +8,353 @@ tags:
 - smolvla
 - robotics
 - ur7e
 - code-as-policies
 - imitation-learning
 - CoRL2026
 ---
-# SmolVLA UR7e Arrange Block 100epi (10 epochs)
-This repository contains a SmolVLA policy fine-tuned for the UR7e arrange-block task using the LeRobot dataset [`CoRL2026-CSI/UR7e-CaP_arrange_block_100epi`](https://huggingface.co/datasets/CoRL2026-CSI/UR7e-CaP_arrange_block_100epi).
-## Model Details
-- **Policy:** SmolVLA
-- **Base checkpoint:** [`lerobot/smolvla_base`](https://huggingface.co/lerobot/smolvla_base)
-- **Training dataset:** [`CoRL2026-CSI/UR7e-CaP_arrange_block_100epi`](https://huggingface.co/datasets/CoRL2026-CSI/UR7e-CaP_arrange_block_100epi)
-- **Robot:** UR7e
-- **Checkpoint:** step 5520, approximately 10 epochs
-- **Reported training loss at checkpoint:** 0.009
-## Dataset
-The policy was trained on 100 episodes with 141,253 frames at 30 FPS. The dataset contains two RGB camera streams:
-- `observation.images.realsense_wrist`
-- `observation.images.realsense_topview`
-The action space is 7-dimensional: six UR7e joint positions plus gripper position.
-## Training Configuration
-- **Micro batch size:** 64
-- **Gradient accumulation:** 4
-- **Effective batch size:** 256
-- **Total run length:** 5,520 optimizer steps for the 10 epoch run
-- **Optimizer:** AdamW
-- **Peak learning rate:** 1e-4
-- **Final logged learning rate for this checkpoint:** 2.5e-06
-- **Image augmentation:** enabled, up to 2 transforms per frame
-- **Final logged gradient norm for this checkpoint:** 0.095
-Camera keys were remapped during training:
-```json
 {
   "observation.images.realsense_wrist": "observation.images.camera1",
   "observation.images.realsense_topview": "observation.images.camera2"
 }
 ```
-## Usage
-Use this model as a LeRobot policy checkpoint:
-```bash
-python -m lerobot.scripts.lerobot_eval \
-  --policy.path=CoRL2026-CSI/smolvla_ur7e_arrange_block_100epi_10ep
 ```
-For Python loading inside LeRobot code, use the SmolVLA policy loader with this repository id as the pretrained path.
-## Evaluation and Limitations
-This model card reports training checkpoint information only. No rollout success rate or real-robot evaluation metric is included in this repository.
-The checkpoint is intended for the UR7e arrange-block setup and assumes a compatible observation/action schema, including the camera remapping described above.
-## Provenance
-- Dataset license: Apache-2.0, as declared by the dataset repository.
-- VLM backbone: [`HuggingFaceTB/SmolVLM2-500M-Video-Instruct`](https://huggingface.co/HuggingFaceTB/SmolVLM2-500M-Video-Instruct).
-- Fine-tuning run: `smolvla_ur7e_arrange_block_100epi_bs64_acc4_ep10_20260509_130552`.

 - smolvla
 - robotics
 - ur7e
+- ur7e
 - code-as-policies
 - imitation-learning
 - CoRL2026
 ---
+    # SmolVLA UR7e Arrange Block 100epi (10 epochs)
+    This repository contains a SmolVLA policy checkpoint fine-tuned with LeRobot. The model card is intentionally detailed so the training run can be reproduced or debugged from the uploaded artifact.
+    ## Model Details
+    - **Policy:** SmolVLA
+    - **Base checkpoint:** [`lerobot/smolvla_base`](https://huggingface.co/lerobot/smolvla_base)
+    - **Training dataset:** [`CoRL2026-CSI/UR7e-CaP_arrange_block_100epi`](https://huggingface.co/datasets/CoRL2026-CSI/UR7e-CaP_arrange_block_100epi)
+    - **Training script:** `lerobot/scripts/train_smolvla_ur7e.sh`
+    - **Checkpoint:** step `5520`, approximately `10.00` epochs
+    - **Reported training loss at checkpoint:** `0.009`
+    - **Resolved config:** [`train_config.json`](train_config.json)
+    Related checkpoints from the same run:
+    - [5ep checkpoint](https://huggingface.co/CoRL2026-CSI/smolvla_ur7e_arrange_block_100epi_5ep)
+- [10ep checkpoint](https://huggingface.co/CoRL2026-CSI/smolvla_ur7e_arrange_block_100epi_10ep)
+    ## Dataset
+    | Key | Value |
+|---|---|
+| `Robot` | UR7e |
+| `Episodes` | 100 |
+| `Frames` | 141,253 |
+| `Tasks` | 1 |
+| `FPS` | 30 |
+| `Camera streams` | `observation.images.realsense_wrist`, `observation.images.realsense_topview` |
+| `Dataset state/action shape` | [7] / [7] |
+    ## Reproduction
+    The uploaded [`train_config.json`](train_config.json) is the authoritative serialized LeRobot config for this checkpoint. The table below mirrors the key values for quick inspection.
+    | Key | Value |
+|---|---|
+| `script` | lerobot/scripts/train_smolvla_ur7e.sh |
+| `job_name` | smolvla_ur7e_arrange_block_100epi_bs64_acc4_ep10_20260509_130552 |
+| `output_dir` | /home/work/hscho/corl_2026/AutoDataCollector/lerobot/outputs/train/smolvla_ur7e_arrange_block_100epi_bs64_acc4_ep10_20260509_130552 |
+| `seed` | 1000 |
+| `launch` | single-process CUDA training via `python -m lerobot.scripts.lerobot_train` |
+| `checkpoint_step` | 5520 |
+| `checkpoint_epoch` | 10.00 |
+| `checkpoint_train_loss` | 0.009 |
+| `checkpoint_grad_norm` | 0.095 |
+| `checkpoint_lr` | 2.5e-06 |
+| `effective_batch` | 64 x 1 x 4 = 256 |
+    Approximate script invocation:
+    ```bash
+    cd /home/work/hscho/corl_2026/AutoDataCollector/lerobot
+CONDA_ENV="lerobot" POLICY_TYPE="smolvla" POLICY_PATH="lerobot/smolvla_base" DATASET_REPO_ID="CoRL2026-CSI/UR7e-CaP_arrange_block_100epi" BATCH_SIZE="64" GRADIENT_ACCUMULATION_STEPS="4" STEPS="5520" NUM_WORKERS="4" DATALOADER_PREFETCH_FACTOR="1" CUDA_VISIBLE_DEVICES="0" NUM_GPUS="1" MIXED_PRECISION="bf16" SAVE_FREQ="2760" LOG_FREQ="10" EVAL_FREQ="0" WANDB_PROJECT="lerobot-smolvla-ur7e" OMP_NUM_THREADS="4" MKL_NUM_THREADS="4" PYTORCH_CUDA_ALLOC_CONF="expandable_segments:True" bash train_smolvla_ur7e.sh
+    ```
+    ## Detailed Hyperparameters
+    ### Script Defaults and Environment
+    | Key | Value |
+|---|---|
+| `CONDA_ENV` | lerobot |
+| `POLICY_TYPE` | smolvla |
+| `POLICY_PATH` | lerobot/smolvla_base |
+| `DATASET_REPO_ID` | CoRL2026-CSI/UR7e-CaP_arrange_block_100epi |
+| `BATCH_SIZE` | 64 |
+| `GRADIENT_ACCUMULATION_STEPS` | 4 |
+| `STEPS` | 5520 |
+| `NUM_WORKERS` | 4 |
+| `DATALOADER_PREFETCH_FACTOR` | 1 |
+| `CUDA_VISIBLE_DEVICES` | 0 |
+| `NUM_GPUS` | 1 |
+| `MIXED_PRECISION` | bf16 |
+| `SAVE_FREQ` | 2760 |
+| `LOG_FREQ` | 10 |
+| `EVAL_FREQ` | 0 |
+| `WANDB_PROJECT` | lerobot-smolvla-ur7e |
+| `OMP_NUM_THREADS` | 4 |
+| `MKL_NUM_THREADS` | 4 |
+| `PYTORCH_CUDA_ALLOC_CONF` | expandable_segments:True |
+    ### Training Loop and Dataloader
+    | Key | Value |
+|---|---|
+| `steps` | 5520 |
+| `batch_size` | 64 |
+| `gradient_accumulation_steps` | 4 |
+| `num_workers` | 4 |
+| `dataloader_prefetch_factor` | 1 |
+| `dataloader_persistent_workers` | False |
+| `dataloader_pin_memory` | True |
+| `save_freq` | 2760 |
+| `log_freq` | 10 |
+| `eval_freq` | 0 |
+| `cudnn_deterministic` | False |
+| `use_policy_training_preset` | True |
+| `ddp_find_unused_parameters` | True |
+| `profile_timing` | False |
+    ### Dataset Pipeline
+    | Key | Value |
+|---|---|
+| `dataset.repo_id` | CoRL2026-CSI/UR7e-CaP_arrange_block_100epi |
+| `dataset.root` | `null` |
+| `dataset.episodes` | `null` |
+| `dataset.revision` | `null` |
+| `dataset.use_imagenet_stats` | True |
+| `dataset.video_backend` | torchcodec |
+| `dataset.streaming` | False |
+    Image augmentation settings:
+    ```json
+{
+  "enable": true,
+  "max_num_transforms": 2,
+  "random_order": true,
+  "tfs": {
+    "brightness": {
+      "weight": 1.0,
+      "type": "ColorJitter",
+      "kwargs": {
+        "brightness": [
+          0.8,
+          1.2
+        ]
+      }
+    },
+    "contrast": {
+      "weight": 1.0,
+      "type": "ColorJitter",
+      "kwargs": {
+        "contrast": [
+          0.8,
+          1.2
+        ]
+      }
+    },
+    "saturation": {
+      "weight": 1.0,
+      "type": "ColorJitter",
+      "kwargs": {
+        "saturation": [
+          0.5,
+          1.5
+        ]
+      }
+    },
+    "hue": {
+      "weight": 1.0,
+      "type": "ColorJitter",
+      "kwargs": {
+        "hue": [
+          -0.05,
+          0.05
+        ]
+      }
+    },
+    "sharpness": {
+      "weight": 1.0,
+      "type": "SharpnessJitter",
+      "kwargs": {
+        "sharpness": [
+          0.5,
+          1.5
+        ]
+      }
+    },
+    "affine": {
+      "weight": 1.0,
+      "type": "RandomAffine",
+      "kwargs": {
+        "degrees": [
+          -5.0,
+          5.0
+        ],
+        "translate": [
+          0.05,
+          0.05
+        ]
+      }
+    }
+  }
+}
+```
+    Camera rename map:
+    ```json
 {
   "observation.images.realsense_wrist": "observation.images.camera1",
   "observation.images.realsense_topview": "observation.images.camera2"
 }
 ```
+    ### Policy Configuration
+    ```json
+{
+  "type": "smolvla",
+  "pretrained_path": "lerobot/smolvla_base",
+  "vlm_model_name": "HuggingFaceTB/SmolVLM2-500M-Video-Instruct",
+  "load_vlm_weights": true,
+  "num_vlm_layers": 16,
+  "freeze_vision_encoder": true,
+  "train_expert_only": true,
+  "train_state_proj": true,
+  "use_peft": false,
+  "use_amp": false,
+  "chunk_size": 50,
+  "n_action_steps": 50,
+  "num_steps": 10,
+  "max_state_dim": 32,
+  "max_action_dim": 32,
+  "resize_imgs_with_padding": [
+    512,
+    512
+  ],
+  "tokenizer_max_length": 48,
+  "attention_mode": "cross_attn",
+  "pad_language_to": "max_length",
+  "use_cache": true,
+  "num_expert_layers": 0,
+  "expert_width_multiplier": 0.75,
+  "self_attn_every_n_layers": 2,
+  "min_period": 0.004,
+  "max_period": 4.0,
+  "compile_model": false,
+  "compile_mode": "max-autotune",
+  "normalization_mapping": {
+    "VISUAL": "IDENTITY",
+    "STATE": "MEAN_STD",
+    "ACTION": "MEAN_STD"
+  },
+  "input_features": {
+    "observation.state": {
+      "type": "STATE",
+      "shape": [
+        6
+      ]
+    },
+    "observation.images.camera1": {
+      "type": "VISUAL",
+      "shape": [
+        3,
+        256,
+        256
+      ]
+    },
+    "observation.images.camera2": {
+      "type": "VISUAL",
+      "shape": [
+        3,
+        256,
+        256
+      ]
+    },
+    "observation.images.camera3": {
+      "type": "VISUAL",
+      "shape": [
+        3,
+        256,
+        256
+      ]
+    }
+  },
+  "output_features": {
+    "action": {
+      "type": "ACTION",
+      "shape": [
+        7
+      ]
+    }
+  }
+}
+```
+    ### Optimizer
+    ```json
+{
+  "type": "adamw",
+  "lr": 0.0001,
+  "weight_decay": 1e-10,
+  "grad_clip_norm": 10.0,
+  "betas": [
+    0.9,
+    0.95
+  ],
+  "eps": 1e-08
+}
+```
+    ### Scheduler
+    ```json
+{
+  "type": "cosine_decay_with_warmup",
+  "num_warmup_steps": 1000,
+  "num_decay_steps": 30000,
+  "peak_lr": 0.0001,
+  "decay_lr": 2.5e-06
+}
 ```
+    ### Logging
+    ```json
+{
+  "enable": true,
+  "disable_artifact": false,
+  "project": "lerobot-smolvla-ur7e",
+  "entity": null,
+  "notes": null,
+  "run_id": "e1h98rll",
+  "mode": null
+}
+```
+    ## Usage
+    Use this model as a LeRobot policy checkpoint:
+    ```bash
+    python -m lerobot.scripts.lerobot_eval \
+      --policy.path=CoRL2026-CSI/smolvla_ur7e_arrange_block_100epi_10ep
+    ```
+    For Python loading inside LeRobot code, use the SmolVLA policy loader with this repository id as the pretrained path.
+    ## Evaluation and Limitations
+    This model card reports training checkpoint information only. No rollout success rate or task-level evaluation metric is included in this repository.
+    The checkpoint assumes a compatible observation/action schema and the camera remapping shown above. The optimizer/RNG `training_state` files are not included; only the loadable `pretrained_model` artifact is uploaded.
+    ## Provenance
+    - VLM backbone: [`HuggingFaceTB/SmolVLM2-500M-Video-Instruct`](https://huggingface.co/HuggingFaceTB/SmolVLM2-500M-Video-Instruct)
+    - Fine-tuning run: `smolvla_ur7e_arrange_block_100epi_bs64_acc4_ep10_20260509_130552`
+    - Source training script: `lerobot/scripts/train_smolvla_ur7e.sh`