--- license: apache-2.0 library_name: transformers pipeline_tag: image-text-to-text tags: - vision-action - inverse-dynamics-model - embodied-ai - game-ai - internvl datasets: - open-world-agents/D2E-480p - open-world-agents/D2E-Original arxiv: 2510.05684 --- # Generalist-IDM-1B **Generalist Inverse Dynamics Model** for predicting keyboard and mouse actions from gameplay video. [Project Page](https://worv-ai.github.io/d2e/) · [Paper (arXiv)](https://arxiv.org/abs/2510.05684) · [GitHub](https://github.com/worv-ai/D2E) · [Demo](https://huggingface.co/spaces/lastdefiance20/Generalist-IDM) ## Model Description Generalist-IDM-1B is a vision-action model trained on the [D2E dataset](https://huggingface.co/datasets/open-world-agents/D2E-480p)—267 hours of synchronized gameplay video and input events from 29 PC games. Given a trajectory of screen frames and actions, the model predicts the missing actions between observations (Inverse Dynamics Model). - **Architecture**: Based on InternVL with 0.9B parameters - **Input**: Trajectory containing screen frames (448×448) and keyboard/mouse events with timestamps - **Output**: Predicted keyboard and mouse events for gaps in the trajectory - **Training Data**: 29 PC games across diverse genres (FPS, open-world, sandbox, roguelike, etc.) ## Quick Start The easiest way to run inference is using the standalone script from the [D2E repository](https://github.com/worv-ai/D2E): ```bash # Clone the repository git clone https://github.com/worv-ai/D2E.git cd D2E # Run inference (dependencies auto-installed by uv) uv run inference.py input_video.mp4 output.mcap ``` ### Prerequisites - [uv](https://docs.astral.sh/uv/) - FFmpeg - CUDA-capable GPU (~8GB+ VRAM) ### Options ```bash uv run inference.py input_video.mp4 output.mcap --device cuda # GPU inference (default) uv run inference.py input_video.mp4 output.mcap --device cpu # CPU inference uv run inference.py input_video.mp4 output.mcap --max-duration 30 # Limit to 30 seconds ``` > ⏱️ **Inference Time**: On H100, processing 1 second of video takes ~6 seconds. For a 1-minute video, expect ~6 minutes of inference time. ## Output Format The output is an [MCAP](https://mcap.dev/) file containing predicted keyboard and mouse events with nanosecond timestamps synchronized to the input video. You can visualize the output using the [Dataset Visualizer](https://huggingface.co/spaces/open-world-agents/visualize_dataset). Dataset Visualizer Preview ## Programmatic Usage ```python import torch from transformers import AutoModelForImageTextToText, AutoProcessor model = AutoModelForImageTextToText.from_pretrained( "open-world-agents/Generalist-IDM-1B", device_map="cuda", torch_dtype=torch.bfloat16, trust_remote_code=True, ) processor = AutoProcessor.from_pretrained( "open-world-agents/Generalist-IDM-1B", trust_remote_code=True, ) ``` For full inference pipeline with video preprocessing and MCAP output, see [`inference.py`](https://github.com/worv-ai/D2E/blob/main/inference.py). ## Training Data This model was trained on the D2E dataset: | Dataset | Resolution | Description | |---------|------------|-------------| | [D2E-480p](https://huggingface.co/datasets/open-world-agents/D2E-480p) | 480p 60fps | 267 hours from 29 PC games | | [D2E-Original](https://huggingface.co/datasets/open-world-agents/D2E-Original) | FHD/QHD | Original resolution recordings | ## Citation ```bibtex @article{choi2025d2e, title={D2E: Scaling Vision-Action Pretraining on Desktop Data for Transfer to Embodied AI}, author={Choi, Suhwan and Jung, Jaeyoon and Seong, Haebin and Kim, Minchan and Kim, Minyeong and Cho, Yongjun and Kim, Yoonshik and Park, Yubeen and Yu, Youngjae and Lee, Yunsung}, journal={arXiv preprint arXiv:2510.05684}, year={2025} } ``` ## License Apache 2.0