You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Cosmos-3-H-Surgical-Simulator-alpha

Cosmos-3-H-Surgical-Simulator-alpha is a pre-release, action-conditioned fine-tune of nvidia/Cosmos3-Super on surgical robotics sequences from nvidia/PhysicalAI-Robotics-Open-H-Embodiment. It is being released for early comment, evaluation, and integration feedback while training continues.

This checkpoint is intended to explore a narrow but important question: can the Cosmos 3 Super omnimodal world model be adapted into a surgical robotics world simulator that is conditioned not only on a starting endoscopic image, but also on an explicit future robot action trajectory?

We believe this is the first publicly released action-conditioned fine-tune of Cosmos 3 Super on Open-H-Embodiment. It should be treated as an alpha artifact. The current state is useful for research inspection, offline rollouts, and integration prototyping, but it is not a finished product, not a clinical device, not a robot control policy, and not validated for patient care or autonomous surgical use.

Status

Release stage: alpha/pre-release.
Base model: nvidia/Cosmos3-Super.
Fine-tuning data: Open-H-Embodiment surgical robotics branches.
Fine-tuning method: LoRA on Cosmos 3 Super.
Primary task: short-horizon forward dynamics from an initial endoscopic frame plus a 12-step surgical robot action chunk.
Current checkpoint: checkpoints/latest_checkpoint.txt points to iter_000000060.
Checkpoint format: PyTorch Distributed Checkpoint (DCP).
Full inference package: forthcoming. The model card below documents the current conditioning contract and the intended integration path so early evaluators can reason about the artifact before the polished inference code lands.

What this model does

The model takes a first visual observation and an action trajectory, then generates a short future video consistent with the requested action sequence. The intended use case is surgical robotics world modeling: a robot or simulator provides the current endoscopic view and candidate future kinematics, and the model predicts how the scene may evolve under those kinematics.

This release is centered on the forward dynamics setting:

current endoscopic image + action[0:12] + task/context metadata
    -> predicted video frames 0:12

The current alpha was trained and evaluated on short Open-H windows with a 12-action horizon and 13 generated/conditioned video frames. The first frame is the visual condition; the following frames are synthesized under action conditioning. The common evaluation configuration used during early rollouts is 30 FPS, Cosmos image size 256, 16 sampling steps, guidance 3.0, and shift 5.0.

Action conditioning in detail

Cosmos 3 supports action as a native modality. This fine-tune adds a surgical robotics action bridge that maps heterogeneous Open-H action/state layouts into a single action contract before the sequence is passed to Cosmos 3.

The model-facing action contract

The model-facing surgical simulator action is a 44-dimensional vector per timestep with the embodiment name:

open_h_surgical_sim

The training adapter registers this embodiment with a 44D raw action dimension. Cosmos 3 internally pads/embeds action channels according to the model configuration (max_action_dim = 64), but user-provided action files for this alpha should be written as 44D rows.

The first 20 dimensions carry a dVRK-style dual-arm relative pose command:

Index range	Field
0:3	left arm relative translation `(dx, dy, dz)`
3:9	left arm relative rotation in 6D rotation representation
9	left jaw/gripper target
10:13	right arm relative translation `(dx, dy, dz)`
13:19	right arm relative rotation in 6D rotation representation
19	right jaw/gripper target
20:44	reserved bridge channels, currently zero-padded for supported dVRK-style layouts

The 6D rotation representation is the first two columns of the relative rotation matrix, flattened in the convention used by the training adapter. The relative pose is computed as:

T_relative = inverse(T_current) * T_target

for each patient-side manipulator arm. Translations are therefore expressed in the current tool frame, not as arbitrary image-space offsets.

Supported source layouts during training

Open-H-Embodiment is intentionally heterogeneous: different contributing roots store kinematics with different feature names, action/state keys, and robot embodiments. The training adapter only accepted roots that could be mapped safely into the contract above. Unsupported layouts were skipped explicitly rather than sliced blindly.

Supported conversion paths include:

native 44D surgical simulator bridge actions;
native 20D dVRK dual-arm relative pose actions, padded into the 44D bridge;
16D dual-arm absolute Cartesian pose rows of the form xyz(3), quaternion_xyzw(4), jaw(1) for each arm;
23D variants that include dual-arm Cartesian pose plus ECM pose, where the dual PSM pose is used for the action bridge;
named 14D dual-arm rows such as jaw, ee_x, ee_y, ee_z, roll, pitch, yaw per arm when paired with a supported Cartesian state;
named absolute left/right arm action rows in quaternion or roll/pitch/yaw form, converted to a relative command against the first row of the chunk.

Joint-only 14D rows and other unrecognized layouts were not used for this alpha training path. This is deliberate: a plausible-looking tensor with the wrong semantics is worse than no tensor at all for action-conditioned world modeling.

Action JSON shape

For inference, an action file should contain 12 rows of 44 floating point values:

[
  [0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0],
  [0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
]

The exact JSON schema accepted by the current internal scripts is deliberately simple: a nested numeric list or an object containing an action array can be adapted into the Cosmos 3 sample format. The forthcoming public inference package will include validated loaders, conversion helpers for common Open-H layouts, and examples. The snippet above is shortened to two rows for display; the standard alpha rollout horizon uses 12 rows.

How to run the alpha checkpoint today

The current checkpoint is a Cosmos 3 DCP checkpoint, not a standalone Diffusers pipeline. Use it with a matching Cosmos 3 checkout plus the surgical simulator adapter that registers the open_h_surgical_sim embodiment and experiment.

At a high level:

Download or clone this repository with Git LFS enabled.
Download the Cosmos 3 codebase and the tokenizer/VAE assets required by the Cosmos 3 Super inference stack.
Restore checkpoints/iter_000000060 as the checkpoint path.
Use the surgical simulator experiment config that enables the Open-H action bridge and LoRA modules.
Run inference with regular weights, not EMA weights. This alpha was trained with EMA disabled, so inference should pass --no-use-ema-weights.

An illustrative sample input looks like this:

{
  "name": "example_open_h_rollout",
  "model_mode": "forward_dynamics",
  "vision_path": "first_frame.png",
  "action_path": "actions.json",
  "domain_name": "open_h_surgical_sim",
  "view_point": "ego_view",
  "num_frames": 13,
  "action_chunk_size": 12,
  "fps": 30,
  "image_size": 256,
  "num_steps": 16,
  "guidance": 3.0,
  "shift": 5.0,
  "seed": 3407
}

An illustrative Cosmos 3 invocation is:

torchrun --nproc-per-node=8 \
  -m cosmos_framework.scripts.inference \
  --parallelism-preset=throughput \
  --dp-shard-size=8 \
  --dp-replicate-size=1 \
  --cp-size=1 \
  --cfgp-size=1 \
  --max-num-seqs=1 \
  -i "samples/*/cosmos3_input.json" \
  -o outputs/rollouts \
  --checkpoint-path checkpoints/iter_000000060 \
  --config-file cosmos3_h_surgical_simulator/experiment.py \
  --experiment cosmos3_super_openh_surgical_lora \
  --seed 3407 \
  --no-use-ema-weights \
  --no-guardrails \
  --no-use-torch-compile \
  --no-use-cuda-graphs \
  --experiment-overrides model.config.tokenizer.vae_path=/path/to/Wan2.2_VAE.pth

This command is shown to document the contract, not as a polished public API. Full inference code, conversion utilities, tested examples, and a smaller developer-facing wrapper are forthcoming.

Hardware note

Cosmos 3 Super is super... large. The current alpha has been run successfully on a single 8x80GB GPU node using 8-way data-parallel sharding. A single large GPU can be insufficient for model instantiation depending on the exact memory allocator state, CUDA graph settings, and checkpoint load path. Early evaluators should plan for an 8xA100 80GB or 8xH100 80GB class node until the public inference wrapper and memory notes are finalized.

FlashDreams-style live simulator integration

The intended downstream shape is a FlashDreams-style interactive simulator runner rather than a one-off batch script. This model belongs in a runner-plugin or serving-adapter lane, not a config-only lane:

the backbone is Cosmos 3 Super, a mixture-of-transformers omnimodal world model rather than a Wan-family DiT, so it will need some love, care and adaptation;
the checkpoint is a native Cosmos 3 PyTorch DCP tree with LoRA weights;
the model-specific conditioner is the 44D open_h_surgical_sim action trajectory plus surgical view/domain metadata; and
the runtime must coordinate a live frame buffer, action chunking, model sampling, decode, and simulator/display feedback.

A practical FlashDreams-style integration would have the following pieces.

1. Runner config

Define a runner slug such as:

cosmos3-h-surgical-simulator-alpha

The runner config should expose:

ckpt_path: path or Hub snapshot for checkpoints/iter_000000060;
vae_path: local Wan/Cosmos VAE path used by the Cosmos 3 tokenizer;
input_frame: current endoscopic RGB frame;
action_source: a live teleoperation, robot API, simulation, or logged Open-H action stream;
output_path or stream_sink: mp4 writer, shared-memory texture, WebRTC stream, or simulator viewport sink;
sampling knobs: steps, guidance, shift, seed, FPS, horizon length;
safety flags: disable clinical use, mark outputs as simulated/predicted, and prevent direct closed-loop robot actuation without a separate safety layer.

2. Action conditioner

The runner should normalize all supported robot-side sources into the 44D bridge before handing the sample to Cosmos 3:

robot/sim state + target command
    -> dual-arm relative pose command
    -> 20D dVRK prefix
    -> 44D open_h_surgical_sim row
    -> 12-row action chunk

For a live teleoperation system, this action conditioner should run at the same temporal cadence as the video stream. If the controller produces higher-rate servo targets, the integration should resample or window those targets to the model FPS, preserving the intended physical horizon rather than merely slicing the first 12 commands.

3. Live rollout loop

A simple interactive loop is:

Capture or receive the current endoscopic frame.
Read the next 12 target actions from the teleoperation, planner, or simulator timeline.
Convert the action chunk to the 44D bridge.
Build one Cosmos 3 sample with model_mode = forward_dynamics, domain_name = open_h_surgical_sim, and the current frame as vision_path.
Run the Cosmos 3 sampler for a 13-frame prediction.
Decode and publish the predicted video to the simulator UI.
Advance by a configurable stride and repeat.

For a true interactive simulator, the output should be presented as a predicted future, not as ground truth. The most natural UI is a split or overlay view: live endoscope frame, model-predicted rollout, and optional comparison against logged video when replay data is available.

4. Verification before calling it integrated

The FlashDreams integration should be verified in the same order recommended by the FlashDreams model-integration workflow:

CPU smoke test: runner imports, static config is discoverable, and the action conditioner returns [T, 44] tensors for known Open-H fixtures.
Checkpoint key/shape check: prove the DCP/LoRA checkpoint loads into the selected Cosmos 3 model without dropping LoRA keys. For this alpha, inference should skip only EMA keys (net_ema.) and should not skip lora_.
GPU rollout smoke: one frame plus one 12-step action chunk produces a valid 13-frame mp4.
Upstream parity: compare against the reference Cosmos 3 inference path on the same input, seed, and checkpoint.
Interactive smoke: connect the runner to a live or replayed action source, stream predictions to the UI, and verify that timing, action windowing, and visual output remain stable over repeated chunks.

No official FlashDreams plugin for this alpha is included in the current Hub repo. The section above is the planned integration contract and a guide for early reviewers who want to prototype the runner while the full release code is being prepared.

Training configuration summary

The current alpha was trained as a LoRA fine-tune on top of Cosmos 3 Super:

LoRA enabled on generated-modality attention projections: q_proj_moe_gen, k_proj_moe_gen, v_proj_moe_gen, o_proj_moe_gen.
LoRA rank: 16.
LoRA alpha: 32.
Precision: bfloat16.
Action generation enabled.
Maximum model action dimension: 64.
Surgical simulator raw action dimension: 44.
Open-H chunk length: 12 actions.
Primary mode: forward_dynamics.
Cosmos resolution: 256.
EMA: disabled.
Checkpoint format: PyTorch Distributed Checkpoint.

This is an early release. Future checkpoints may change behavior, action coverage, and quality. We are releasing this alpha now because early feedback on the action interface, simulation loop, and evaluation protocol is more valuable than waiting for a fully polished checkpoint.

Intended uses

This model is intended for:

research on action-conditioned world modeling for surgical robotics;
offline rollout generation from logged Open-H-style action trajectories;
evaluation of Cosmos 3 as a base for robot-conditioned video simulation; and
studying how heterogeneous medical robotics kinematics can be mapped into a shared world-model action space.

Out-of-scope uses

Do not use this alpha for:

patient care;
clinical decision support;
autonomous surgical control;
direct robot actuation;
evaluation as a validated medical device; and
training or validating a policy without independent safety, regulatory, and dataset-governance review.

Generated videos may be visually plausible while still being physically wrong. The model should not be used as a source of clinical truth.

Limitations

This is an alpha checkpoint from an ongoing training run.
The current public inference wrapper is not yet released.
The model has not undergone systematic benchmark evaluation.
The current action bridge prioritizes safely mapped dVRK-style and dual-arm Cartesian layouts; unsupported Open-H layouts were skipped.
The model is short-horizon and should not be assumed to maintain physical consistency over long rollouts.
Surgical data is heterogeneous across institutions, embodiments, procedures, cameras, and annotation conventions.
The model may copy visual biases, camera artifacts, occlusions, and workflow distributions present in the training data.
The current checkpoint is large and requires substantial GPU memory.

Acknowledgements

This work depends directly on the NVIDIA Cosmos 3 release and on the Open-H-Embodiment community dataset. We thank the Cosmos 3 authors for releasing an open omnimodal world-model stack and the Open-H-Embodiment contributors for making paired surgical video and kinematics available for research.

Open-H-Embodiment is maintained by NVIDIA with contributions from a broad medical robotics community.

Citation

If you use this alpha checkpoint, please cite this model, Cosmos 3, and Open-H-Embodiment:

@misc{voncsefalvay2026cosmos3hsurgicalsimulatoralpha,
  title        = {Cosmos-3-H-Surgical-Simulator-alpha: An Action-Conditioned Cosmos 3 Super Fine-Tune for Surgical Robotics World Modeling},
  author       = {von Csefalvay, Chris},
  year         = {2026},
  howpublished = {\url{https://huggingface.co/chrisvoncsefalvay/cosmos3-h-surgical-simulator}},
  note         = {Pre-release action-conditioned fine-tune of nvidia/Cosmos3-Super on Open-H-Embodiment}
}

@misc{nvidia2026cosmos3,
  title         = {Cosmos 3: Omnimodal World Models for Physical AI},
  author        = {{NVIDIA} and Aditi and Niket Agarwal and others},
  year          = {2026},
  eprint        = {2606.02800},
  archivePrefix = {arXiv},
  primaryClass  = {cs.CV},
  doi           = {10.48550/arXiv.2606.02800},
  url           = {https://arxiv.org/abs/2606.02800}
}

@misc{openh2026embodiment,
  title         = {Open-H-Embodiment: A Large-Scale Dataset for Enabling Foundation Models in Medical Robotics},
  author        = {{Open-H-Embodiment Consortium}},
  year          = {2026},
  eprint        = {2604.21017},
  archivePrefix = {arXiv},
  primaryClass  = {cs.RO},
  url           = {https://arxiv.org/abs/2604.21017}
}