docs/minicpm-v-to-vla.md · sanjuhs/fireboy-minicpm-v-4-6-vla at main

fireboy-minicpm-v-4-6-vla / docs /minicpm-v-to-vla.md

sanjuhs

Duplicate from build-small-hackathon/fireboy-minicpm-v-4-6-vla

5bd41a1 15 days ago

preview code

Raw

History Blame Contribute Delete

7.26 kB

	# How MiniCPM-V Becomes A Fire Boy VLA

	## The Target

	The true VLA target is:

	```text
	image + language + robot state -> action
	```

	For Fire Boy:

	```text
	what Fire Boy sees
	+ what the user says
	+ how Fire Boy's body is currently posed
	-> what Fire Boy's body should do next
	```

	This is different from a normal chatbot. A chatbot outputs text. A VLA outputs
	physical actions.

	## What MiniCPM-V Already Gives Us

	MiniCPM-V is useful because it can already process:

	```text
	image + text
	```

	For example:

	```text
	image: Toy Room camera frame
	text: "Fire Boy, pick up the yellow ball"
	```

	MiniCPM-V can help understand:

	```text
	there is a yellow ball in front of Fire Boy
	the user wants Fire Boy to pick it up
	the ball is the target object
	```

	But MiniCPM-V does not naturally output MuJoCo joint commands.

	So we do not only ask it to write a sentence. We attach a new action-producing
	part to it.

	## The VLA Architecture

	The Fire Boy VLA model should look like this:

	```text
	camera image
	\|
	v
	MiniCPM-V vision encoder

	user command
	\|
	v
	MiniCPM-V language encoder

	Fire Boy robot state
	\|
	v
	robot state encoder

	\|
	v
	fused vision-language-state features

	\|
	v
	continuous action head

	\|
	v
	joint targets / action chunk
	```

	In compact form:

	```text
	MiniCPM-V(image, text) + RobotStateEncoder(state) -> ActionHead -> action
	```

	## What Is Robot State?

	Robot state is the body information the model needs so it knows what action is
	physically possible right now.

	For Fire Boy in MuJoCo/Newton, robot state should include:

	```text
	root position
	root rotation
	joint angles
	joint velocities
	hand positions
	foot contacts
	body orientation
	held object state
	nearby object positions
	previous action
	```

	Example:

	```json
	{
	"root_position": [0.1, 0.0, 0.6],
	"root_rotation": [0.0, 0.0, 0.1],
	"joint_angles": [0.03, -0.12, 0.44],
	"joint_velocities": [0.1, 0.0, -0.2],
	"left_hand_position": [0.35, 0.18, 0.52],
	"right_hand_position": [0.34, -0.18, 0.52],
	"target_object": "yellow_ball",
	"target_position": [0.8, 0.0, 0.2],
	"is_holding_object": false
	}
	```

	The image tells the model what the room looks like. The language tells it what
	the user wants. The robot state tells it what Fire Boy's body is currently
	doing.

	## What Is The Action?

	The action is not text like:

	```text
	"I will pick up the ball"
	```

	The action is numeric control output.

	Possible action formats:

	```text
	joint target positions
	joint target deltas
	joint torques
	end-effector target deltas
	short action chunks
	```

	For Fire Boy, the safest first version is usually:

	```text
	joint target deltas or joint target positions
	```

	Example output:

	```json
	{
	"next_10_steps": [
	{
	"shoulder_L_pitch": 0.12,
	"elbow_L": 0.18,
	"wrist_L_pitch": -0.04,
	"shoulder_R_pitch": 0.11,
	"elbow_R": 0.17,
	"wrist_R_pitch": -0.05,
	"hip_L_pitch": 0.03,
	"hip_R_pitch": -0.03
	}
	]
	}
	```

	This is why we call it:

	```text
	continuous action
	```

	The output numbers are continuous values, not words or categories.

	## What Is An Action Head?

	The action head is a small neural network attached after the MiniCPM-V features.

	MiniCPM-V produces an internal feature vector that represents the image and
	language. The robot state encoder produces another feature vector. We combine
	them, then the action head maps that combined vector to actions.

	Mathematically:

	```text
	z_vl = MiniCPM_V(image, text)
	z_state = StateEncoder(robot_state)
	z = concat(z_vl, z_state)
	action = ActionHead(z)
	```

	Where:

	```text
	z_vl = vision/language features
	z_state = body/proprioception features
	z = combined features
	action = joint commands or action chunk
	```

	The action head can be a simple MLP at first:

	```text
	Linear -> activation -> Linear -> activation -> Linear -> action vector
	```

	Later it can be a diffusion action head or transformer action head, similar in
	spirit to modern VLA systems.

	## What Does "Action Chunk" Mean?

	Instead of predicting only one tiny action for the next physics timestep, the
	model predicts a short sequence:

	```text
	next 0.5 seconds of actions
	```

	For example, if control runs at 20 Hz:

	```text
	10 future actions = 0.5 seconds
	```

	This is useful because body motion is continuous. Fire Boy should not decide
	from scratch every millisecond. He should produce a smooth short movement:

	```text
	reach toward ball
	close hands
	lift slightly
	stabilize
	```

	Then the model replans again from the new image and body state.

	## How Training Works

	To train the VLA, we need examples like:

	```text
	input:
	image
	command
	robot state

	target:
	action that worked
	```

	This is supervised learning over successful behavior.

	The dataset row looks like:

	```json
	{
	"image": "frame_000123.png",
	"language": "Fire Boy, pick up the yellow ball",
	"robot_state": {
	"joint_angles": "...",
	"joint_velocities": "...",
	"contacts": "..."
	},
	"action": {
	"joint_target_delta_chunk": "..."
	}
	}
	```

	The model predicts an action. We compare it to the correct action from the
	dataset.

	Loss:

	```text
	loss = predicted_action - successful_action
	```

	Usually this is an L1 or L2 loss:

	```text
	L2 loss = mean((predicted_action - target_action)^2)
	```

	Then backpropagation changes the action head, state encoder, and optionally part
	of MiniCPM-V so the next prediction is closer.

	## Do We Train All Of MiniCPM-V?

	Not at first.

	The practical staged approach:

	```text
	1. Freeze most of MiniCPM-V.
	2. Train the robot state encoder and action head.
	3. Add LoRA adapters to MiniCPM-V if needed.
	4. Fine-tune only small parts of MiniCPM-V.
	5. Keep full-model fine-tuning as a later expensive option.
	```

	This is better because MiniCPM-V already understands images and text. We mainly
	need to teach it how those features connect to Fire Boy's body actions.

	## Why We Must Fix Fire Boy Physics First

	The VLA needs action labels. Those labels must control the actual Fire Boy body.

	If the physics body is wrong, the dataset is wrong.

	Bad pipeline:

	```text
	wrong MuJoCo body
	-> wrong actions
	-> VLA learns wrong body behavior
	-> Toy Room Fire Boy still looks broken
	```

	Correct pipeline:

	```text
	Fire Boy GLB-matched physics body
	-> successful physics rollouts
	-> image/state/action dataset
	-> MiniCPM-V action fine-tuning
	-> Toy Room Fire Boy performs grounded actions
	```

	So the first real implementation milestone is not VLA training. It is:

	```text
	make Fire Boy's physics body match fire-boy-rig/fire-boy-rigged-full.glb
	```

	## Final Mental Model

	MiniCPM-V gives Fire Boy perception and language understanding.

	The robot state encoder gives Fire Boy body awareness.

	The action head gives Fire Boy physical control.

	Together:

	```text
	MiniCPM-V + robot state encoder + action head = Fire Boy VLA
	```

	But the model can only learn good physical action after the physics body is
	correct.