fireboy-minicpm-v-4-6-vla / docs /minicpm-vla-action-head-scaffold.md
sanjuhs's picture
Duplicate from build-small-hackathon/fireboy-minicpm-v-4-6-vla
5bd41a1
|
Raw
History Blame Contribute Delete
11.9 kB

MiniCPM-V 4.6 VLA Action-Head Scaffold

Backbone Smoke Test

RunPod smoke test passed:

model_id: openbmb/MiniCPM-V-4.6
processor_loaded: MiniCPMV4_6Processor
model_loaded: MiniCPMV4_6ForConditionalGeneration
cuda: true
device_map_ready

This means the MiniCPM-V 4.6 backbone can load on RunPod.

Official model notes:

  • MiniCPM-V 4.6 uses SigLIP2-400M plus a Qwen3.5-0.8B LLM.
  • It supports image/video understanding.
  • Official fine-tuning routes include LLaMA-Factory and ms-swift.

Source:

https://huggingface.co/openbmb/MiniCPM-V-4.6

What We Train First

Do not begin with full end-to-end MiniCPM fine-tuning.

First train:

robot state encoder + continuous action head

Then train:

MiniCPM-V LoRA adapters + action head

The output is an action chunk, not a sentence:

next 0.5 seconds of normalized joint targets

At 20 Hz, 0.5 seconds is 10 action steps.

Current MuJoCo action-chunk checkpoint status:

pick_up chunk:       20/20 on RunPod
go_eat_berry chunk: 20/20 on RunPod
chunk length used:   16 normalized joint-target actions

Artifacts:

Fireboy-training-policy-vla/runpod-artifacts/runpod_artifacts/learned_chunk/faithful_chunk_pick_up_policy_ep000.mp4
Fireboy-training-policy-vla/runpod-artifacts/runpod_artifacts/learned_chunk/faithful_chunk_go_eat_berry_policy_ep000.mp4

Dataset Row

The VLA row should look like:

{
  "image_path": "/abs/path/to/frame.jpg",
  "instruction": "walk to the yellow target",
  "robot_state": {
    "qpos": [],
    "qvel": [],
    "ctrl": [],
    "previous_action": [],
    "right_hand_pos": [],
    "left_hand_pos": [],
    "mouth_pos": [],
    "ball_pos": [],
    "task_flags": [],
    "stage_flags": []
  },
  "action_type": "normalized_joint_targets",
  "action_chunk_steps": 10,
  "action_chunk": []
}

Manifest Builder

New script:

fireboy-vla-physics/src/build_vla_action_manifest.py

RunPod rollout-builder script:

fireboy-vla-physics/scripts/generate_vla_rollouts_runpod.sh

First generated manifest:

Fireboy-training-policy-vla/vla-rollouts/vla_manifests/fireboy_vla_action_chunks_20260615-021838.jsonl

Summary:

episodes: 64
images: 2368
manifest rows: 2368
chunk steps: 10
tasks: pick_up, go_eat_berry, run_around, go_to_point

First manifest action-head baseline:

checkpoint: fireboy-vla-physics/build/checkpoints/fireboy_vla_manifest_action_head/vla_manifest_action_head.pt
pick_up:       2/8
go_eat_berry: 2/8
run_around:   8/8
go_to_point:  7/8

Meaning:

The VLA JSONL -> action-head path works.
The small mixed image dataset is not enough for reliable contact manipulation.
MiniCPM-V LoRA should not start on this tiny manifest alone.

Manipulation-heavy manifest result:

manifest: Fireboy-training-policy-vla/vla-rollouts/vla_manifests/fireboy_vla_action_chunks_manip-20260615-025016.jsonl
episodes: 144
images: 6192
rows: 6192
tasks: pick_up, go_eat_berry

Focused action-head eval from that manifest:

checkpoint: fireboy-vla-physics/build/checkpoints/fireboy_vla_manifest_action_head_manip/vla_manifest_action_head.pt
pick_up:       12/12
go_eat_berry: 12/12

Meaning:

The VLA action-head path is now reliable for manipulation when the manifest is
large and task-focused enough. The next MiniCPM-V LoRA step should use this
manipulation-heavy manifest plus the locomotion/navigation manifest.

First MiniCPM-V frozen-encoder checkpoint:

checkpoint: fireboy-vla-physics/build/checkpoints/fireboy_minicpm_vla_action_head_smoke/minicpm_vla_action_head.pt
model: openbmb/MiniCPM-V-4.6
rows: 64
VL embedding dim: 1024
state dim: 27
action chunk: 10 x 32 normalized joint targets
RunPod eval pick_up: 1/1

Proof:

Fireboy-training-policy-vla/runpod-artifacts/runpod_artifacts/fireboy_minicpm_vla_action_head_smoke/minicpm_vla_pick_up_policy_ep000.mp4

Meaning:

image + language + robot state -> action chunk is now proven end-to-end at
smoke scale with MiniCPM-V 4.6 frozen. This is the correct step immediately
before LoRA. LoRA has not been trained yet.

Scaled MiniCPM-V residual-fusion checkpoint:

checkpoint: fireboy-vla-physics/build/checkpoints/fireboy_minicpm_vla_action_head_residual_2048/minicpm_vla_action_head.pt
model: openbmb/MiniCPM-V-4.6
MiniCPM-V: frozen
rows: 2048
train rows: 1802
val rows: 246
VL embedding dim: 1024
state dim: 27
action chunk: 10 x 32 normalized joint targets
head: state_residual_fusion_v1
vl_residual_scale: 0.12
action_std_floor: 0.01
RunPod GPU: NVIDIA RTX 6000 Ada Generation
RunPod eval pick_up: 3/3
RunPod eval go_eat_berry: 3/3
pod after run: deleted
runpodctl pod list --all -> []

Proof:

Fireboy-training-policy-vla/runpod-artifacts/runpod_artifacts/fireboy_minicpm_vla_action_head_residual_2048/minicpm_vla_pick_up_policy_ep000.mp4
Fireboy-training-policy-vla/runpod-artifacts/runpod_artifacts/fireboy_minicpm_vla_action_head_residual_2048/minicpm_vla_go_eat_berry_policy_ep000.mp4
Fireboy-training-policy-vla/runpod-artifacts/runpod_artifacts/fireboy_minicpm_vla_action_head_residual_2048/minicpm_vla_pick_up_policy_contact_sheet.jpg
Fireboy-training-policy-vla/runpod-artifacts/runpod_artifacts/fireboy_minicpm_vla_action_head_residual_2048/minicpm_vla_go_eat_berry_policy_contact_sheet.jpg

Why the residual head matters:

The 256-row single-tower MiniCPM action head failed closed-loop eval:
pick_up 0/1, go_eat_berry 0/1.

The working version uses a state-dominant controller branch plus a smaller
MiniCPM vision-language residual branch. That preserves reliable robot-state
control while still satisfying the VLA form:

image + language + robot state -> action chunk

First MiniCPM-V LoRA adapter checkpoint:

checkpoint: fireboy-vla-physics/build/checkpoints/fireboy_minicpm_vla_lora_residual_512/minicpm_vla_lora_action_head.pt
adapter: fireboy-vla-physics/build/checkpoints/fireboy_minicpm_vla_lora_residual_512/lora_adapter
seed checkpoint: fireboy_minicpm_vla_action_head_residual_2048
rows: 512
train rows: 461
val rows: 51
LoRA rank: 8
LoRA alpha: 16
state controller branch: frozen
RunPod eval pick_up: 1/1
RunPod eval go_eat_berry: 1/1
pod after run: deleted
runpodctl pod list --all -> []

LoRA proof:

Fireboy-training-policy-vla/runpod-artifacts/runpod_artifacts/fireboy_minicpm_vla_lora_residual_512/minicpm_vla_pick_up_policy_ep000.mp4
Fireboy-training-policy-vla/runpod-artifacts/runpod_artifacts/fireboy_minicpm_vla_lora_residual_512/minicpm_vla_go_eat_berry_policy_ep000.mp4
Fireboy-training-policy-vla/runpod-artifacts/runpod_artifacts/fireboy_minicpm_vla_lora_residual_512/minicpm_vla_lora_pick_up_policy_contact_sheet.jpg
Fireboy-training-policy-vla/runpod-artifacts/runpod_artifacts/fireboy_minicpm_vla_lora_residual_512/minicpm_vla_lora_go_eat_berry_policy_contact_sheet.jpg

Important boundary:

This is now a real MiniCPM-V LoRA VLA checkpoint for manipulation rollouts.
It is not yet the final generalized pet model because movement commands
run_around/go_to_point still need to be added to the MiniCPM-V LoRA dataset and
closed-loop eval suite.

All-skill frozen-encoder attempt:

checkpoint: fireboy-vla-physics/build/checkpoints/fireboy_minicpm_vla_action_head_allskill_3072/minicpm_vla_action_head.pt
rows: 3072
pick_up:       2/2
go_eat_berry: 0/2
run_around:   2/2
go_to_point:  0/2

Meaning:

One shared MiniCPM action head is not reliable enough yet. The current safest
pet architecture is:

language command router
  -> manipulation LoRA VLA head for pick_up/go_eat_berry
  -> movement policy/head for run_around/go_to_point

Then train a better unified all-skill LoRA once balancing and navigation data
are improved.

Movement-only MiniCPM action-head attempt:

checkpoint: fireboy-vla-physics/build/checkpoints/fireboy_minicpm_vla_action_head_movement_992/minicpm_vla_action_head.pt
rows: 992
run_around:  3/3
go_to_point: 1/3

Current practical routing:

pick_up/go_eat_berry -> fireboy_minicpm_vla_lora_residual_512
run_around           -> fireboy_minicpm_vla_action_head_movement_992
go_to_point          -> existing state/action go_to_point policy for now

Usage after generating episodes with images:

python fireboy-vla-physics/src/generate_articulated_dataset.py \
  --task go_to_point \
  --num-episodes 200 \
  --out-dir fireboy-vla-physics/build/datasets/vla_go_to_point_images \
  --seed 12000 \
  --save-images

python fireboy-vla-physics/src/build_vla_action_manifest.py \
  --dataset-dir fireboy-vla-physics/build/datasets/vla_go_to_point_images \
  --out fireboy-vla-physics/build/vla_manifests/go_to_point_action_chunks.jsonl \
  --chunk-steps 10 \
  --stride 2

Use the same pattern for:

run_around
pick_up
go_eat_berry

For pickup/eat, use expert/controller traces first because the learned BC checkpoint is not reliable yet.

First Training Stage

Train a state+language action head:

instruction embedding + robot_state -> action_chunk

Purpose:

  • prove action chunks train better than single-step BC
  • avoid spending MiniCPM LoRA compute before the action representation works

Second Training Stage

Freeze most of MiniCPM-V:

MiniCPM-V(image, instruction) -> vision-language embedding
robot_state_encoder(robot_state) -> state embedding
concat -> action_head -> action_chunk

Train:

state encoder
action head
optionally LoRA adapters

Do not train:

full MiniCPM-V weights

until the action-head smoke test works.

LoRA Adapter Target

Use LoRA only after the action head produces stable closed-loop rollouts.

Recommended:

LoRA rank: 8 or 16
precision: bf16
batching: gradient accumulation
backbone: mostly frozen
GPU: L40S / RTX 6000 Ada / A100

How This Becomes A Pet

The MiniCPM-V VLA should not directly solve every motor detail at first.

It should learn:

image + command + robot_state
  -> skill/action chunk

Working low-level skills:

go_to_point_clock
run_around
pick_up_chunk
go_eat_berry_chunk

Controller/expert skills available for data:

pick_up
go_eat_berry

ProtoMotions/Newton/Kimodo lane:

human-like G1 walk/run/gesture policy
  -> Fire Boy visual costume
  -> richer pet motion prior

Blockers

Current blockers before full VLA:

1. Need larger MiniCPM-V frozen-encoder/action-head training, not just smoke scale.
2. Need MiniCPM-backed `go_eat_berry`, `run_around`, and `go_to_point` evals.
3. Need MiniCPM-V LoRA/action-head training after frozen encoder reliability.
4. Need HF token with gated Llama access or Kimodo text-encoder service for Kimodo generation.
5. Need Fire Boy mesh retargeted onto G1/ProtoMotions body if using the fastest human-like route.

Latest Navigation Result

Direct MiniCPM-V low-level navigation was tested further on RunPod:

absolute 1-step go_to_point: 0/5
root_velocity_v1 go_to_point: 0/5
root_velocity_v1 with recovery data: 0/5

The recovery dataset itself was generated correctly:

episodes: 96
rows/images: 884
root_y range: about -1.84 to +1.77
command_y labels include both signs

So the next VLA design should not keep asking MiniCPM to directly output root joint targets for navigation. The practical pet architecture is now:

MiniCPM-V / command router
  -> LoRA manipulation head for pick_up/go_eat_berry
  -> articulated movement policy for walk_to/go_to_point
  -> articulated movement policy for run_around

Toy V3 bridge/runtime status:

walk to the yellow marker: passes through pet_runtime and src/mujoco_policy_bridge.py
run around: passes through pet_runtime and src/mujoco_policy_bridge.py
MP4/GIF URLs are returned in debug.mujocoPolicy