Can Predicted Dynamics Exist in the Physical World?

Community Article Published May 30, 2026

Predictive Physical AI systems increasingly produce more than a single action. They output state rollouts, action chunks, latent plans, and world-model forecasts that are then consumed by a planner, controller, or downstream policy. A central difficulty is that low prediction error does not guarantee that a particular decoded proposal is physically executable.

This post summarizes a runtime verification layer for predictive Physical AI: a physical admissibility gate. The gate evaluates a proposed rollout or action chunk before execution and asks whether it is compatible with a specified monitored physical envelope. The method does not claim to certify task success or full-system safety. It rejects proposals that violate necessary or calibrated physical admissibility conditions under the monitored state-action envelope.

Why this interface matters

Modern robot learning systems often learn powerful prediction interfaces:

World models forecast future states under candidate actions;
Diffusion and action-chunking policies generate short action sequences;
Vision-language-action (VLA) policies map observations and language into robot commands;
Latent planning systems produce internal future trajectories.

These outputs can be statistically plausible and still fail as candidate dynamics. A rollout may be smooth but inconsistent with its paired actions. An action chunk may remain inside action bounds while requiring impossible state transitions. A direct multi-horizon forecast may disagree with its own recursively composed predictions. The proposal in our work is to place a runtime gate between the learned predictor and the downstream controller.

The monitored setting: LeRobot PushT

The experiments use the Hugging Face LeRobot PushT dataset. PushT is a planar pushing task with synchronized image observations, a two-dimensional monitored state, and two-dimensional continuous actions. The state used here should be interpreted as a monitored coordinate, not a complete physical state of the full system. The evaluation is therefore envelope-relative: it tests whether a proposal is compatible with the empirical monitored-state and action envelope induced by the data.

The same episode can be viewed in the monitored state space.

For a nominal episode, the runtime score remains below the rejection threshold.

What the gate evaluates

The gate decomposes a decoded proposal into three interfaces.

Kinematic admissibility checks the state curve itself: recursive reachability and finite-difference growth.
Dynamic consistency checks action-conditioned transitions: whether the paired states and actions agree with a learned one-step dynamics relation.
Predictor-interface consistency checks direct-to-composed horizon agreement: whether a direct forecast agrees with the recursively composed forecast.

The scalar score S is the maximum normalized residual across available conditions. A proposal is rejected when $S>\eta$ , where $\eta$ is a certified or calibrated threshold. This maximum reflects a conjunction: if any required condition is violated, the proposal should not be passed downstream under the specified monitored envelope.

World-model baselines

To evaluate the method on predictive dynamics, compact world-model interfaces were trained on PushT:

a five-member state-only one-step ensemble;
a history-conditioned one-step model using recent state-action context;
a direct multi-horizon predictor that outputs the full future state sequence.

These are not introduced as new world-model architectures. They serve as controlled predictive interfaces for testing the admissibility gate.

The monitored PushT state is partially observable. The history-conditioned model achieves substantially lower rollout RMSE than the state-only and direct multi-horizon models, which supports the use of history-conditioned dynamics as a diagnostic baseline.

Dynamic violation detection

The main falsification study compares nominal held-out trajectories to structured dynamic violations. The perturbations include smooth impulses, actuator lag, local time warping, contact-like mode changes, action-state mismatch, and action saturation. The key result is that kinematic checks alone are weak for structured dynamic failures, while action-conditioned residuals are much stronger.

In this run:

transition-RMSE residual reaches AUC (0.982) and AP (0.997);
standardized dynamics residual reaches AUC (0.972) and AP (0.995);
kinematic-only monitoring reaches AUC (0.592);
the full admissibility gate reaches AUC (0.957) and AP (0.993).

The full gate is not presented as the best scalar detector. Its value is that it integrates rejection, condition-level attribution, and fallback routing. The score distributions show the separation between nominal windows and structured violations.

Which condition activates?

The ablation study separates the monitor components. It shows that the dynamic residual is the dominant condition for action-conditioned violations, while purely kinematic growth and reachability conditions are less discriminative in this PushT monitored-state setting. The direct-to-composed flow-consistency test evaluates a different failure mode: disagreement between a direct multi-horizon forecast and its recursively composed counterpart.

This condition is not a claim that the trajectory is physically impossible. It is a predictor-interface check: if a forecasting interface claims both direct and recursive predictions, those predictions should agree within tolerance.

The replay experiment evaluates how the gate changes downstream decisions. Each candidate action chunk is either accepted or routed to a fallback nominal chunk. In replay, residual-based filters and the full physical-admissibility gate reject most labeled invalid proposals before replay execution. The full gate rejects (87.7%) of labeled invalid proposals with an (8.5%) false-intervention rate on nominal chunks, while preserving mean progress near (0.998). The intervention tradeoff highlights the operating point.

The replay results should not be interpreted as demonstrated prevention of physical hardware failures. They show that, in a controlled replay setting, the runtime gate can reject labeled invalid proposals while preserving progress on nominal chunks.

Sensitivity to envelope and threshold design

The method depends on the selected monitored envelope and rejection threshold. This is expected: certified envelopes, simulator-derived envelopes, and demonstration-derived envelopes provide different operating regimes. The calibration sweep reports the tradeoff between false rejection on nominal windows and detection of controlled dynamic violations.

What this does and does not claim

The gate is a runtime guardrail for decoded predictive proposals. It is useful when a learned policy, VLA, planner, or world model emits candidate future dynamics that should be checked before execution.

It does not prove:

task success;
full robot safety;
correctness of the learned model;
feasibility in unmonitored hidden variables;
physical certification beyond the specified envelope.

It does provide:

a model-agnostic interface for checking decoded proposals;
a separation between kinematic, dynamic, and predictor-interface failures;
component-level attribution for rejection;
a practical bridge between predictive robot learning and runtime verification.

Reproducibility notes

The experiments are built around the Hugging Face LeRobot ecosystem and the public lerobot/pusht dataset.

Citation

@misc{or2026physicaladmissibility,
  title        = {Can Predicted Dynamics Exist in the Physical World?},
  author       = {Or, Barak},
  year         = {2026},
  note         = {Preprint}
}

AI, Physical AI, World Models, VLA, VLM, and Other Terms We Should Stop Mixing Together

May 17, 2026

Community

agentlans

6 days ago

I'm no expert, but this reminds me of traditional control theory like PID controllers and Kalman filters. Seems like the real benefit of modern AI is learning and adapting the complex, high-level policies you mentioned in the blog post. For autonomous robots and drones, combining classical control with reinforcement learning seems to be the key.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote