Modal Inference Results: Fire Boy MiniCPM-V Router

Date: 2026-06-15

Live Endpoint

Modal app: fireboy-vla-router
URL: https://sanjuhs123--fireboy-vla-router.modal.run
GPU: L40S
idle scaledown window: 60 seconds
checkpoint: Fireboy-training-policy-vla/runpod-artifacts/checkpoints/fireboy_minicpm_vla_skill_param_head/minicpm_vla_skill_param_head.pt
model: openbmb/MiniCPM-V-4.6
policy_kind: minicpm_vla_frozen_encoder_skill_param_head_v1

This endpoint serves the promoted frozen MiniCPM-V skill/parameter router with a custom PyTorch action head. It is not served through vLLM because the router needs MiniCPM hidden states plus a custom continuous head, not token generation.

Local Website Wiring

TOYBOX_VLA_ROUTER_URL=https://sanjuhs123--fireboy-vla-router.modal.run
TOYBOX_VLA_ROUTER_ACTION=1
local app: http://127.0.0.1:65373
policy gallery: http://127.0.0.1:65373/fireboy-policy-gallery

The Toy Room path is:

browser command -> /api/pet-action
  -> Modal /route
  -> MiniCPM-V frozen encoder + skill/parameter head
  -> MuJoCo policy registry dispatch
  -> Toy Room animation/result JSON

Verification Matrix

All commands below were tested through the local website API on 2026-06-15. The VLA router ran on Modal with device: cuda.

walk to the yellow marker
  served skill: walk_to
  dispatch: registry:walk_to
  /api/pet-action: success true
  animation: walk

run around
  served skill: run_around
  dispatch: registry:run_around
  /api/pet-action: success true
  animation: run

pick up the berry
  served skill: pick_up
  dispatch: registry:pick_up
  /api/pet-action: success true
  animation: hold

go find berry and eat it
  served skill: find_and_eat_berry
  dispatch: registry:find_and_eat_berry
  /api/pet-action: success true
  animation: hold

Important Runtime Guard

With a blank/generated camera frame, the raw neural skill head can become overconfident toward find_and_eat_berry. The live endpoint therefore exposes:

neural_skill: raw MiniCPM-V head prediction
skill: command/scene-stabilized served skill
raw_params: raw continuous head output
params: scene-grounded served parameters

This keeps the demo reliable while preserving transparency. If the browser sends a real camera frame and full robot state, the same endpoint can be tested with force_neural_skill: true to inspect the pure neural decision.

Proof Screenshot

Fireboy-training-policy-vla/proofs/modal-vla-router-policy-gallery.png

Repeatable Final Smoke Gate

Run this before submission:

PYTHONPATH=fireboy-vla-physics/src \
fireboy-vla-physics/.venv/bin/python \
fireboy-vla-physics/src/final_vla_demo_smoke.py \
  --out Fireboy-training-policy-vla/proofs/final-vla-demo-smoke.json