YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

DriveWAM: Video Generative Priors Enable Scalable World-Action Modeling for Autonomous Driving

Chen Shi*, Jinrui Xu*, Shaoshuai Shi, Kehua Sheng, Bo Zhang, Li Jiang†

The Chinese University of Hong Kong, Shenzhen & Voyager Research, Didi Chuxing

*Equal Contribution, †Corresponding Author

DriveWAM is a joint video generation and action prediction model for autonomous driving. It adapts a pretrained video diffusion transformer into an autoregressive video-action policy, organizing video and action streams into a unified temporal token sequence trained under a joint flow-matching objective — preserving video generation priors while extending the model to ego-motion action prediction.

Highlights

NavSim

Comparison on NAVSIM v1. *: results with imitation learning. †: trained with multiple trajectory anchors. MV: multi-view cameras; SV: single-view camera; L: LiDAR.

Method	Ref	Sensors	NC ↑	DAC ↑	TTC ↑	C. ↑	EP ↑	PDMS ↑
Human	–	–	100.0	100.0	100.0	99.9	87.5	94.8
UniAD	CVPR'23	MV	97.8	91.9	92.9	100.0	78.8	83.4
TransFuser	TPAMI'23	MV & L	97.7	92.8	92.8	100.0	79.2	84.0
PARA-Drive	CVPR'24	MV	97.9	92.4	93.0	99.8	79.3	84.0
LAW	ICLR'25	SV	96.4	95.4	88.7	99.9	81.7	84.6
DiffusionDrive	CVPR'25	MV & L	98.2	96.2	94.7	100.0	82.2	88.1
WoTE	ICCV'25	MV & L	98.5	96.8	94.4	99.9	81.9	88.3
VLA-based Methods
ReCogDrive*	ICLR'26	MV	98.1	94.7	94.2	100.0	80.9	86.5
DriveVLA-W0	ICLR'26	SV	98.7	96.2	95.5	100.0	82.2	88.4
AutoVLA	NeurIPS'25	MV	98.4	95.6	98.0	99.9	81.9	89.1
DriveDreamer-Policy	arXiv'26	MV	98.4	97.1	95.1	100.0	83.5	89.2
DriveVLA-W0†	ICLR'26	SV	98.7	99.1	95.3	99.3	83.3	90.2
WA-based Methods
Epona	ICCV'25	SV	97.9	95.1	93.8	99.9	80.4	86.2
WorldDrive	arXiv'26	SV	98.4	95.8	95.2	99.8	83.3	89.0
DriveWAM (Ours)	–	SV	98.3	98.1	95.2	100.0	84.3	90.1

PhysicalAI-AV

Comparison on PhysicalAI-Autonomous-Vehicles.

Method	Source	ADE@3s ↓	FDE@3s ↓	ADE@4s ↓	FDE@4s ↓
VaVAM	Valeo	2.31	4.32	-	-
Alpamayo-1.5	NVIDIA	0.80	2.31	1.44	4.18
DriveWAM (Ours)	—	0.47	1.35	0.83	2.47

Qualitative Results

Data Scaling

DriveWAM's action prediction error improves consistently as training data scales from 4k to 100k clips. Scene-evolving (SE) guidance provides complementary benefit at every scale.

# Clips	# Iters	SE Guidance	ADE@4s ↓	FDE@4s ↓
4k	50k	✗	1.21	3.65
4k	50k	✓	1.01	2.95
20k	50k	✗	0.95	2.94
20k	50k	✓	0.94	2.65
100k	50k	✗	0.92	2.75
100k	50k	✓	0.83	2.47

News

[Jun 7, 2026] We open-source all code and model weights.
[May 27, 2026] We release the paper and project page.

Getting Started

Installation

First, clone this repository and set up the environment.

git clone <repo-url>
cd DriveWAM

# 1. Create conda environment
conda env create -f environment.yml
conda activate drivewam

# 2. Install PyTorch (CUDA 12.6)
pip install torch==2.9.0 torchvision==0.24.0 --index-url https://download.pytorch.org/whl/cu126

# 3. Install Flash Attention
pip install flash-attn==2.8.3 --no-build-isolation

Two optional extras, installed when you need the corresponding feature:

# NavSim evaluation extras (for the NavSim benchmark)
pip install -r requirements-navsim.txt

# VLM preprocessing extras (to generate navigation guidance)
pip install vllm qwen-vl-utils

Data Preparation

DriveWAM trains and evaluates on two benchmarks. Prepare whichever you need.

NavSim

Follow the NavSim installation guide to download the nuPlan-based dataset and cache metric files, and export the environment variables from that guide (OPENSCENE_DATA_ROOT, NUPLAN_MAPS_ROOT, NUPLAN_MAP_VERSION). Then extract per-scene samples:

# navtrain split (training)
python -m src.navsim.process_data --output-path ./data/navsim/trainval

# navtest split (evaluation)
python -m src.navsim.process_data \
    --navsim-log-path $OPENSCENE_DATA_ROOT/navsim_logs/test \
    --sensor-blobs-path $OPENSCENE_DATA_ROOT/sensor_blobs/test \
    --scene-filter-config navsim/planning/script/config/common/train_test_split/scene_filter/navtest.yaml \
    --output-path ./data/navsim/test

Each scene becomes one pkl file, which is what the training and evaluation scripts read by default:

./data/navsim/trainval/
    sample_000000.pkl
    sample_000001.pkl
    ...

PhysicalAI-Autonomous-Vehicles

The raw dataset is hosted on Hugging Face and accessed through the physical_ai_av devkit. The devkit requires Python ≥ 3.11, so install it in a separate environment from drivewam:

pip install physical_ai_av

Accept the dataset license on the Hugging Face page, then download the dataset (or a subset of chunks) to ./data/physicalai. DriveWAM only needs the camera_front_wide_120fov camera plus the egomotion and calibration features. Extract 10 Hz clips from the download:

python -m src.physicalai.process_data \
    --dataset_root ./data/physicalai \
    --output_dir ./data/physicalai/front \
    --num_workers 16

This writes one directory per clip:

./data/physicalai/
├── clip_index.parquet          # official train/test split; keep it even if you prune the raw chunks
└── front/
    └── <clip_id>/
        ├── camera_front_wide_120fov.mp4
        └── camera_front_wide_120fov_ego.pkl

VLM navigation prompts are used as conditioning during training and inference. We provide pregenerated prompts: training-split prompts are available on Hugging Face; the 1k-sample test-split prompts used for evaluation are included in the repo at src/physicalai/eval_data/prompts_test_sample_1k.json. To regenerate them yourself:

# Step 1 – generate route / BEV / scene-evolving guidance
bash scripts/drivewam_physicalai_vlm_preprocess.sh

# Step 2 – VLM-based clip quality filtering and sub-sampling
SPLIT=train \
bash scripts/drivewam_physicalai_vlm_data_sample.sh

Training

DriveWAM model checkpoints are available on Hugging Face. DriveWAM is trained on top of LingBot-VA Base, a pretrained autoregressive diffusion transformer. Download the base model weights before training.

Key training hyperparameters (see configs for full details):

Hyperparameter	NavSim / PhysicalAI
Training steps	50 000
Learning rate	1e-5
Optimizer	AdamW (β₁=0.9, β₂=0.95, wd=0.1)
Warmup steps	10
Batch size (per GPU)	1
Precision	bfloat16
Input resolution	256×448
SNR shift (video / action)	5.0 / 1.0

All experiments are conducted on 48 × NVIDIA H20 GPUs.

Edit the config (src/configs/navsim_cfg.py or src/configs/physicalai_cfg.py) to set your paths and hyperparameters, then launch with the matching script:

Benchmark	Config	Launch script
NavSim	`src/configs/navsim_cfg.py`	`scripts/drivewam_navsim_train.sh`
PhysicalAI	`src/configs/physicalai_cfg.py`	`scripts/drivewam_physicalai_train.sh`

# NavSim
bash scripts/drivewam_navsim_train.sh

# PhysicalAI
bash scripts/drivewam_physicalai_train.sh

For PhysicalAI, we provide three CSV files listing clip IDs at different training data scales (4k / 20k / 100k clips), available on Hugging Face. Set the clip_csv field in src/configs/physicalai_cfg.py to the desired scale before training.

Evaluation

NavSim (PDM Score)

PDM score evaluation requires a metric cache — a set of per-scenario .pkl files that store precomputed map and route information to score each predicted trajectory. The precomputed metric cache for the navtest split is available for download on Hugging Face. To generate it yourself, run:

python navsim/planning/script/run_metric_caching.py \
    train_test_split=navtest \
    cache.cache_path=./data/navsim/metric_cache

This writes one metric_cache.pkl per scenario token under ./data/navsim/metric_cache/. Pass the resulting directory to the evaluation script via --metric-cache-path.

python -m src.navsim.eval \
    --checkpoint-path /path/to/checkpoint \
    --config-name navsim_cfg \
    --dataset-path ./data/navsim/test \
    --metric-cache-path ./data/navsim/metric_cache

PhysicalAI

python -m src.physicalai.eval \
    --checkpoint-path /path/to/checkpoint \
    --config-name physicalai_cfg

Citation

If you find DriveWAM useful, please cite:

@article{shi2026drivewam,
  title={DriveWAM: Video Generative Priors Enable Scalable World-Action Modeling for Autonomous Driving},
  author={Shi, Chen and Xu, Jinrui and Shi, Shaoshuai and Sheng, Kehua and Zhang, Bo and Jiang, Li},
  journal={arXiv preprint arXiv:2605.28544},
  year={2026}
}

Acknowledgements

We gratefully acknowledge the following open-source projects that DriveWAM builds upon: Wan2.2, LingBot-VA, NavSim, NVIDIA PhysicalAI-Autonomous-Vehicles.

Downloads last month: -

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for chenchenshi/DriveWAM

DriveWAM: Video Generative Priors Enable Scalable World-Action Modeling for Autonomous Driving

Paper • 2605.28544 • Published 13 days ago