---
license: apache-2.0
library_name: transformers
pipeline_tag: image-text-to-text
tags:
  - vision-action
  - inverse-dynamics-model
  - embodied-ai
  - game-ai
  - internvl
datasets:
  - open-world-agents/D2E-480p
  - open-world-agents/D2E-Original
arxiv: 2510.05684
---

# Generalist-IDM-1B

**Generalist Inverse Dynamics Model** for predicting keyboard and mouse actions from gameplay video.

[Project Page](https://worv-ai.github.io/d2e/) · [Paper (arXiv)](https://arxiv.org/abs/2510.05684) · [GitHub](https://github.com/worv-ai/D2E) · [Demo](https://huggingface.co/spaces/lastdefiance20/Generalist-IDM)

## Model Description

Generalist-IDM-1B is a vision-action model trained on the [D2E dataset](https://huggingface.co/datasets/open-world-agents/D2E-480p)—267 hours of synchronized gameplay video and input events from 29 PC games. Given a trajectory of screen frames and actions, the model predicts the missing actions between observations (Inverse Dynamics Model).

- **Architecture**: Based on InternVL with 0.9B parameters
- **Input**: Trajectory containing screen frames (448×448) and keyboard/mouse events with timestamps
- **Output**: Predicted keyboard and mouse events for gaps in the trajectory
- **Training Data**: 29 PC games across diverse genres (FPS, open-world, sandbox, roguelike, etc.)

## Quick Start

The easiest way to run inference is using the standalone script from the [D2E repository](https://github.com/worv-ai/D2E):

```bash
# Clone the repository
git clone https://github.com/worv-ai/D2E.git
cd D2E

# Run inference (dependencies auto-installed by uv)
uv run inference.py input_video.mp4 output.mcap
```

### Prerequisites

- [uv](https://docs.astral.sh/uv/)
- FFmpeg
- CUDA-capable GPU (~8GB+ VRAM)

### Options

```bash
uv run inference.py input_video.mp4 output.mcap --device cuda        # GPU inference (default)
uv run inference.py input_video.mp4 output.mcap --device cpu         # CPU inference
uv run inference.py input_video.mp4 output.mcap --max-duration 30    # Limit to 30 seconds
```

> ⏱️ **Inference Time**: On H100, processing 1 second of video takes ~6 seconds. For a 1-minute video, expect ~6 minutes of inference time.

## Output Format

The output is an [MCAP](https://mcap.dev/) file containing predicted keyboard and mouse events with nanosecond timestamps synchronized to the input video. You can visualize the output using the [Dataset Visualizer](https://huggingface.co/spaces/open-world-agents/visualize_dataset).

<img src="https://github.com/open-world-agents/owa-dataset-visualizer/blob/main/.github/assets/viewer.png?raw=true" alt="Dataset Visualizer Preview" width="600">

## Programmatic Usage

```python
import torch
from transformers import AutoModelForImageTextToText, AutoProcessor

model = AutoModelForImageTextToText.from_pretrained(
    "open-world-agents/Generalist-IDM-1B",
    device_map="cuda",
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
)
processor = AutoProcessor.from_pretrained(
    "open-world-agents/Generalist-IDM-1B",
    trust_remote_code=True,
)
```

For full inference pipeline with video preprocessing and MCAP output, see [`inference.py`](https://github.com/worv-ai/D2E/blob/main/inference.py).

## Training Data

This model was trained on the D2E dataset:

| Dataset | Resolution | Description |
|---------|------------|-------------|
| [D2E-480p](https://huggingface.co/datasets/open-world-agents/D2E-480p) | 480p 60fps | 267 hours from 29 PC games |
| [D2E-Original](https://huggingface.co/datasets/open-world-agents/D2E-Original) | FHD/QHD | Original resolution recordings |

## Citation

```bibtex
@article{choi2025d2e,
  title={D2E: Scaling Vision-Action Pretraining on Desktop Data for Transfer to Embodied AI},
  author={Choi, Suhwan and Jung, Jaeyoon and Seong, Haebin and Kim, Minchan and Kim, Minyeong and Cho, Yongjun and Kim, Yoonshik and Park, Yubeen and Yu, Youngjae and Lee, Yunsung},
  journal={arXiv preprint arXiv:2510.05684},
  year={2025}
}
```

## License

Apache 2.0