MiniCPM-V 4.6 VLA Action-Head Scaffold
Backbone Smoke Test
RunPod smoke test passed:
model_id: openbmb/MiniCPM-V-4.6
processor_loaded: MiniCPMV4_6Processor
model_loaded: MiniCPMV4_6ForConditionalGeneration
cuda: true
device_map_ready
This means the MiniCPM-V 4.6 backbone can load on RunPod.
Official model notes:
- MiniCPM-V 4.6 uses SigLIP2-400M plus a Qwen3.5-0.8B LLM.
- It supports image/video understanding.
- Official fine-tuning routes include LLaMA-Factory and ms-swift.
Source:
https://huggingface.co/openbmb/MiniCPM-V-4.6
What We Train First
Do not begin with full end-to-end MiniCPM fine-tuning.
First train:
robot state encoder + continuous action head
Then train:
MiniCPM-V LoRA adapters + action head
The output is an action chunk, not a sentence:
next 0.5 seconds of normalized joint targets
At 20 Hz, 0.5 seconds is 10 action steps.
Current MuJoCo action-chunk checkpoint status:
pick_up chunk: 20/20 on RunPod
go_eat_berry chunk: 20/20 on RunPod
chunk length used: 16 normalized joint-target actions
Artifacts:
Fireboy-training-policy-vla/runpod-artifacts/runpod_artifacts/learned_chunk/faithful_chunk_pick_up_policy_ep000.mp4
Fireboy-training-policy-vla/runpod-artifacts/runpod_artifacts/learned_chunk/faithful_chunk_go_eat_berry_policy_ep000.mp4
Dataset Row
The VLA row should look like:
{
"image_path": "/abs/path/to/frame.jpg",
"instruction": "walk to the yellow target",
"robot_state": {
"qpos": [],
"qvel": [],
"ctrl": [],
"previous_action": [],
"right_hand_pos": [],
"left_hand_pos": [],
"mouth_pos": [],
"ball_pos": [],
"task_flags": [],
"stage_flags": []
},
"action_type": "normalized_joint_targets",
"action_chunk_steps": 10,
"action_chunk": []
}
Manifest Builder
New script:
fireboy-vla-physics/src/build_vla_action_manifest.py
RunPod rollout-builder script:
fireboy-vla-physics/scripts/generate_vla_rollouts_runpod.sh
First generated manifest:
Fireboy-training-policy-vla/vla-rollouts/vla_manifests/fireboy_vla_action_chunks_20260615-021838.jsonl
Summary:
episodes: 64
images: 2368
manifest rows: 2368
chunk steps: 10
tasks: pick_up, go_eat_berry, run_around, go_to_point
First manifest action-head baseline:
checkpoint: fireboy-vla-physics/build/checkpoints/fireboy_vla_manifest_action_head/vla_manifest_action_head.pt
pick_up: 2/8
go_eat_berry: 2/8
run_around: 8/8
go_to_point: 7/8
Meaning:
The VLA JSONL -> action-head path works.
The small mixed image dataset is not enough for reliable contact manipulation.
MiniCPM-V LoRA should not start on this tiny manifest alone.
Manipulation-heavy manifest result:
manifest: Fireboy-training-policy-vla/vla-rollouts/vla_manifests/fireboy_vla_action_chunks_manip-20260615-025016.jsonl
episodes: 144
images: 6192
rows: 6192
tasks: pick_up, go_eat_berry
Focused action-head eval from that manifest:
checkpoint: fireboy-vla-physics/build/checkpoints/fireboy_vla_manifest_action_head_manip/vla_manifest_action_head.pt
pick_up: 12/12
go_eat_berry: 12/12
Meaning:
The VLA action-head path is now reliable for manipulation when the manifest is
large and task-focused enough. The next MiniCPM-V LoRA step should use this
manipulation-heavy manifest plus the locomotion/navigation manifest.
First MiniCPM-V frozen-encoder checkpoint:
checkpoint: fireboy-vla-physics/build/checkpoints/fireboy_minicpm_vla_action_head_smoke/minicpm_vla_action_head.pt
model: openbmb/MiniCPM-V-4.6
rows: 64
VL embedding dim: 1024
state dim: 27
action chunk: 10 x 32 normalized joint targets
RunPod eval pick_up: 1/1
Proof:
Fireboy-training-policy-vla/runpod-artifacts/runpod_artifacts/fireboy_minicpm_vla_action_head_smoke/minicpm_vla_pick_up_policy_ep000.mp4
Meaning:
image + language + robot state -> action chunk is now proven end-to-end at
smoke scale with MiniCPM-V 4.6 frozen. This is the correct step immediately
before LoRA. LoRA has not been trained yet.
Scaled MiniCPM-V residual-fusion checkpoint:
checkpoint: fireboy-vla-physics/build/checkpoints/fireboy_minicpm_vla_action_head_residual_2048/minicpm_vla_action_head.pt
model: openbmb/MiniCPM-V-4.6
MiniCPM-V: frozen
rows: 2048
train rows: 1802
val rows: 246
VL embedding dim: 1024
state dim: 27
action chunk: 10 x 32 normalized joint targets
head: state_residual_fusion_v1
vl_residual_scale: 0.12
action_std_floor: 0.01
RunPod GPU: NVIDIA RTX 6000 Ada Generation
RunPod eval pick_up: 3/3
RunPod eval go_eat_berry: 3/3
pod after run: deleted
runpodctl pod list --all -> []
Proof:
Fireboy-training-policy-vla/runpod-artifacts/runpod_artifacts/fireboy_minicpm_vla_action_head_residual_2048/minicpm_vla_pick_up_policy_ep000.mp4
Fireboy-training-policy-vla/runpod-artifacts/runpod_artifacts/fireboy_minicpm_vla_action_head_residual_2048/minicpm_vla_go_eat_berry_policy_ep000.mp4
Fireboy-training-policy-vla/runpod-artifacts/runpod_artifacts/fireboy_minicpm_vla_action_head_residual_2048/minicpm_vla_pick_up_policy_contact_sheet.jpg
Fireboy-training-policy-vla/runpod-artifacts/runpod_artifacts/fireboy_minicpm_vla_action_head_residual_2048/minicpm_vla_go_eat_berry_policy_contact_sheet.jpg
Why the residual head matters:
The 256-row single-tower MiniCPM action head failed closed-loop eval:
pick_up 0/1, go_eat_berry 0/1.
The working version uses a state-dominant controller branch plus a smaller
MiniCPM vision-language residual branch. That preserves reliable robot-state
control while still satisfying the VLA form:
image + language + robot state -> action chunk
First MiniCPM-V LoRA adapter checkpoint:
checkpoint: fireboy-vla-physics/build/checkpoints/fireboy_minicpm_vla_lora_residual_512/minicpm_vla_lora_action_head.pt
adapter: fireboy-vla-physics/build/checkpoints/fireboy_minicpm_vla_lora_residual_512/lora_adapter
seed checkpoint: fireboy_minicpm_vla_action_head_residual_2048
rows: 512
train rows: 461
val rows: 51
LoRA rank: 8
LoRA alpha: 16
state controller branch: frozen
RunPod eval pick_up: 1/1
RunPod eval go_eat_berry: 1/1
pod after run: deleted
runpodctl pod list --all -> []
LoRA proof:
Fireboy-training-policy-vla/runpod-artifacts/runpod_artifacts/fireboy_minicpm_vla_lora_residual_512/minicpm_vla_pick_up_policy_ep000.mp4
Fireboy-training-policy-vla/runpod-artifacts/runpod_artifacts/fireboy_minicpm_vla_lora_residual_512/minicpm_vla_go_eat_berry_policy_ep000.mp4
Fireboy-training-policy-vla/runpod-artifacts/runpod_artifacts/fireboy_minicpm_vla_lora_residual_512/minicpm_vla_lora_pick_up_policy_contact_sheet.jpg
Fireboy-training-policy-vla/runpod-artifacts/runpod_artifacts/fireboy_minicpm_vla_lora_residual_512/minicpm_vla_lora_go_eat_berry_policy_contact_sheet.jpg
Important boundary:
This is now a real MiniCPM-V LoRA VLA checkpoint for manipulation rollouts.
It is not yet the final generalized pet model because movement commands
run_around/go_to_point still need to be added to the MiniCPM-V LoRA dataset and
closed-loop eval suite.
All-skill frozen-encoder attempt:
checkpoint: fireboy-vla-physics/build/checkpoints/fireboy_minicpm_vla_action_head_allskill_3072/minicpm_vla_action_head.pt
rows: 3072
pick_up: 2/2
go_eat_berry: 0/2
run_around: 2/2
go_to_point: 0/2
Meaning:
One shared MiniCPM action head is not reliable enough yet. The current safest
pet architecture is:
language command router
-> manipulation LoRA VLA head for pick_up/go_eat_berry
-> movement policy/head for run_around/go_to_point
Then train a better unified all-skill LoRA once balancing and navigation data
are improved.
Movement-only MiniCPM action-head attempt:
checkpoint: fireboy-vla-physics/build/checkpoints/fireboy_minicpm_vla_action_head_movement_992/minicpm_vla_action_head.pt
rows: 992
run_around: 3/3
go_to_point: 1/3
Current practical routing:
pick_up/go_eat_berry -> fireboy_minicpm_vla_lora_residual_512
run_around -> fireboy_minicpm_vla_action_head_movement_992
go_to_point -> existing state/action go_to_point policy for now
Usage after generating episodes with images:
python fireboy-vla-physics/src/generate_articulated_dataset.py \
--task go_to_point \
--num-episodes 200 \
--out-dir fireboy-vla-physics/build/datasets/vla_go_to_point_images \
--seed 12000 \
--save-images
python fireboy-vla-physics/src/build_vla_action_manifest.py \
--dataset-dir fireboy-vla-physics/build/datasets/vla_go_to_point_images \
--out fireboy-vla-physics/build/vla_manifests/go_to_point_action_chunks.jsonl \
--chunk-steps 10 \
--stride 2
Use the same pattern for:
run_around
pick_up
go_eat_berry
For pickup/eat, use expert/controller traces first because the learned BC checkpoint is not reliable yet.
First Training Stage
Train a state+language action head:
instruction embedding + robot_state -> action_chunk
Purpose:
- prove action chunks train better than single-step BC
- avoid spending MiniCPM LoRA compute before the action representation works
Second Training Stage
Freeze most of MiniCPM-V:
MiniCPM-V(image, instruction) -> vision-language embedding
robot_state_encoder(robot_state) -> state embedding
concat -> action_head -> action_chunk
Train:
state encoder
action head
optionally LoRA adapters
Do not train:
full MiniCPM-V weights
until the action-head smoke test works.
LoRA Adapter Target
Use LoRA only after the action head produces stable closed-loop rollouts.
Recommended:
LoRA rank: 8 or 16
precision: bf16
batching: gradient accumulation
backbone: mostly frozen
GPU: L40S / RTX 6000 Ada / A100
How This Becomes A Pet
The MiniCPM-V VLA should not directly solve every motor detail at first.
It should learn:
image + command + robot_state
-> skill/action chunk
Working low-level skills:
go_to_point_clock
run_around
pick_up_chunk
go_eat_berry_chunk
Controller/expert skills available for data:
pick_up
go_eat_berry
ProtoMotions/Newton/Kimodo lane:
human-like G1 walk/run/gesture policy
-> Fire Boy visual costume
-> richer pet motion prior
Blockers
Current blockers before full VLA:
1. Need larger MiniCPM-V frozen-encoder/action-head training, not just smoke scale.
2. Need MiniCPM-backed `go_eat_berry`, `run_around`, and `go_to_point` evals.
3. Need MiniCPM-V LoRA/action-head training after frozen encoder reliability.
4. Need HF token with gated Llama access or Kimodo text-encoder service for Kimodo generation.
5. Need Fire Boy mesh retargeted onto G1/ProtoMotions body if using the fastest human-like route.
Latest Navigation Result
Direct MiniCPM-V low-level navigation was tested further on RunPod:
absolute 1-step go_to_point: 0/5
root_velocity_v1 go_to_point: 0/5
root_velocity_v1 with recovery data: 0/5
The recovery dataset itself was generated correctly:
episodes: 96
rows/images: 884
root_y range: about -1.84 to +1.77
command_y labels include both signs
So the next VLA design should not keep asking MiniCPM to directly output root joint targets for navigation. The practical pet architecture is now:
MiniCPM-V / command router
-> LoRA manipulation head for pick_up/go_eat_berry
-> articulated movement policy for walk_to/go_to_point
-> articulated movement policy for run_around
Toy V3 bridge/runtime status:
walk to the yellow marker: passes through pet_runtime and src/mujoco_policy_bridge.py
run around: passes through pet_runtime and src/mujoco_policy_bridge.py
MP4/GIF URLs are returned in debug.mujocoPolicy