Instructions to use chenchenshi/DriveWAM with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Diffusers
How to use chenchenshi/DriveWAM with Diffusers:
pip install -U diffusers transformers accelerate
import torch from diffusers import DiffusionPipeline # switch to "mps" for apple devices pipe = DiffusionPipeline.from_pretrained("chenchenshi/DriveWAM", dtype=torch.bfloat16, device_map="cuda") prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k" image = pipe(prompt).images[0] - Notebooks
- Google Colab
- Kaggle
import torch
from diffusers import DiffusionPipeline
# switch to "mps" for apple devices
pipe = DiffusionPipeline.from_pretrained("chenchenshi/DriveWAM", dtype=torch.bfloat16, device_map="cuda")
prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"
image = pipe(prompt).images[0]YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
DriveWAM: Video Generative Priors Enable Scalable World-Action Modeling for Autonomous Driving
Chen Shi*, Jinrui Xu*, Shaoshuai Shi, Kehua Sheng, Bo Zhang, Li Jiangβ
The Chinese University of Hong Kong, Shenzhen & Voyager Research, Didi Chuxing
*Equal Contribution, β Corresponding Author
DriveWAM is a joint video generation and action prediction model for autonomous driving. It adapts a pretrained video diffusion transformer into an autoregressive video-action policy, organizing video and action streams into a unified temporal token sequence trained under a joint flow-matching objective β preserving video generation priors while extending the model to ego-motion action prediction.
Highlights
NavSim
Comparison on NAVSIM v1. *: results with imitation learning. β : trained with multiple trajectory anchors. MV: multi-view cameras; SV: single-view camera; L: LiDAR.
| Method | Ref | Sensors | NC β | DAC β | TTC β | C. β | EP β | PDMS β |
|---|---|---|---|---|---|---|---|---|
| Human | β | β | 100.0 | 100.0 | 100.0 | 99.9 | 87.5 | 94.8 |
| UniAD | CVPR'23 | MV | 97.8 | 91.9 | 92.9 | 100.0 | 78.8 | 83.4 |
| TransFuser | TPAMI'23 | MV & L | 97.7 | 92.8 | 92.8 | 100.0 | 79.2 | 84.0 |
| PARA-Drive | CVPR'24 | MV | 97.9 | 92.4 | 93.0 | 99.8 | 79.3 | 84.0 |
| LAW | ICLR'25 | SV | 96.4 | 95.4 | 88.7 | 99.9 | 81.7 | 84.6 |
| DiffusionDrive | CVPR'25 | MV & L | 98.2 | 96.2 | 94.7 | 100.0 | 82.2 | 88.1 |
| WoTE | ICCV'25 | MV & L | 98.5 | 96.8 | 94.4 | 99.9 | 81.9 | 88.3 |
| VLA-based Methods | ||||||||
| ReCogDrive* | ICLR'26 | MV | 98.1 | 94.7 | 94.2 | 100.0 | 80.9 | 86.5 |
| DriveVLA-W0 | ICLR'26 | SV | 98.7 | 96.2 | 95.5 | 100.0 | 82.2 | 88.4 |
| AutoVLA | NeurIPS'25 | MV | 98.4 | 95.6 | 98.0 | 99.9 | 81.9 | 89.1 |
| DriveDreamer-Policy | arXiv'26 | MV | 98.4 | 97.1 | 95.1 | 100.0 | 83.5 | 89.2 |
| DriveVLA-W0β | ICLR'26 | SV | 98.7 | 99.1 | 95.3 | 99.3 | 83.3 | 90.2 |
| WA-based Methods | ||||||||
| Epona | ICCV'25 | SV | 97.9 | 95.1 | 93.8 | 99.9 | 80.4 | 86.2 |
| WorldDrive | arXiv'26 | SV | 98.4 | 95.8 | 95.2 | 99.8 | 83.3 | 89.0 |
| DriveWAM (Ours) | β | SV | 98.3 | 98.1 | 95.2 | 100.0 | 84.3 | 90.1 |
PhysicalAI-AV
Comparison on PhysicalAI-Autonomous-Vehicles.
| Method | Source | ADE@3s β | FDE@3s β | ADE@4s β | FDE@4s β |
|---|---|---|---|---|---|
| VaVAM | Valeo | 2.31 | 4.32 | - | - |
| Alpamayo-1.5 | NVIDIA | 0.80 | 2.31 | 1.44 | 4.18 |
| DriveWAM (Ours) | β | 0.47 | 1.35 | 0.83 | 2.47 |
Qualitative Results
Data Scaling
DriveWAM's action prediction error improves consistently as training data scales from 4k to 100k clips. Scene-evolving (SE) guidance provides complementary benefit at every scale.
| # Clips | # Iters | SE Guidance | ADE@4s β | FDE@4s β |
|---|---|---|---|---|
| 4k | 50k | β | 1.21 | 3.65 |
| 4k | 50k | β | 1.01 | 2.95 |
| 20k | 50k | β | 0.95 | 2.94 |
| 20k | 50k | β | 0.94 | 2.65 |
| 100k | 50k | β | 0.92 | 2.75 |
| 100k | 50k | β | 0.83 | 2.47 |
News
- [Jun 7, 2026] We open-source all code and model weights.
- [May 27, 2026] We release the paper and project page.
Getting Started
Installation
First, clone this repository and set up the environment.
git clone <repo-url>
cd DriveWAM
# 1. Create conda environment
conda env create -f environment.yml
conda activate drivewam
# 2. Install PyTorch (CUDA 12.6)
pip install torch==2.9.0 torchvision==0.24.0 --index-url https://download.pytorch.org/whl/cu126
# 3. Install Flash Attention
pip install flash-attn==2.8.3 --no-build-isolation
Two optional extras, installed when you need the corresponding feature:
# NavSim evaluation extras (for the NavSim benchmark)
pip install -r requirements-navsim.txt
# VLM preprocessing extras (to generate navigation guidance)
pip install vllm qwen-vl-utils
Data Preparation
DriveWAM trains and evaluates on two benchmarks. Prepare whichever you need.
NavSim
Follow the NavSim installation guide to download the nuPlan-based dataset and cache metric files, and export the environment variables from that guide (OPENSCENE_DATA_ROOT, NUPLAN_MAPS_ROOT, NUPLAN_MAP_VERSION). Then extract per-scene samples:
# navtrain split (training)
python -m src.navsim.process_data --output-path ./data/navsim/trainval
# navtest split (evaluation)
python -m src.navsim.process_data \
--navsim-log-path $OPENSCENE_DATA_ROOT/navsim_logs/test \
--sensor-blobs-path $OPENSCENE_DATA_ROOT/sensor_blobs/test \
--scene-filter-config navsim/planning/script/config/common/train_test_split/scene_filter/navtest.yaml \
--output-path ./data/navsim/test
Each scene becomes one pkl file, which is what the training and evaluation scripts read by default:
./data/navsim/trainval/
sample_000000.pkl
sample_000001.pkl
...
PhysicalAI-Autonomous-Vehicles
The raw dataset is hosted on Hugging Face and accessed through the physical_ai_av devkit. The devkit requires Python β₯ 3.11, so install it in a separate environment from drivewam:
pip install physical_ai_av
Accept the dataset license on the Hugging Face page, then download the dataset (or a subset of chunks) to ./data/physicalai. DriveWAM only needs the camera_front_wide_120fov camera plus the egomotion and calibration features. Extract 10 Hz clips from the download:
python -m src.physicalai.process_data \
--dataset_root ./data/physicalai \
--output_dir ./data/physicalai/front \
--num_workers 16
This writes one directory per clip:
./data/physicalai/
βββ clip_index.parquet # official train/test split; keep it even if you prune the raw chunks
βββ front/
βββ <clip_id>/
βββ camera_front_wide_120fov.mp4
βββ camera_front_wide_120fov_ego.pkl
VLM navigation prompts are used as conditioning during training and inference. We provide pregenerated prompts: training-split prompts are available on Hugging Face; the 1k-sample test-split prompts used for evaluation are included in the repo at src/physicalai/eval_data/prompts_test_sample_1k.json. To regenerate them yourself:
# Step 1 β generate route / BEV / scene-evolving guidance
bash scripts/drivewam_physicalai_vlm_preprocess.sh
# Step 2 β VLM-based clip quality filtering and sub-sampling
SPLIT=train \
bash scripts/drivewam_physicalai_vlm_data_sample.sh
Training
DriveWAM model checkpoints are available on Hugging Face. DriveWAM is trained on top of LingBot-VA Base, a pretrained autoregressive diffusion transformer. Download the base model weights before training.
Key training hyperparameters (see configs for full details):
| Hyperparameter | NavSim / PhysicalAI |
|---|---|
| Training steps | 50 000 |
| Learning rate | 1e-5 |
| Optimizer | AdamW (Ξ²β=0.9, Ξ²β=0.95, wd=0.1) |
| Warmup steps | 10 |
| Batch size (per GPU) | 1 |
| Precision | bfloat16 |
| Input resolution | 256Γ448 |
| SNR shift (video / action) | 5.0 / 1.0 |
All experiments are conducted on 48 Γ NVIDIA H20 GPUs.
Edit the config (src/configs/navsim_cfg.py or src/configs/physicalai_cfg.py) to set your paths and hyperparameters, then launch with the matching script:
| Benchmark | Config | Launch script |
|---|---|---|
| NavSim | src/configs/navsim_cfg.py |
scripts/drivewam_navsim_train.sh |
| PhysicalAI | src/configs/physicalai_cfg.py |
scripts/drivewam_physicalai_train.sh |
# NavSim
bash scripts/drivewam_navsim_train.sh
# PhysicalAI
bash scripts/drivewam_physicalai_train.sh
For PhysicalAI, we provide three CSV files listing clip IDs at different training data scales (4k / 20k / 100k clips), available on Hugging Face. Set the clip_csv field in src/configs/physicalai_cfg.py to the desired scale before training.
Evaluation
NavSim (PDM Score)
PDM score evaluation requires a metric cache β a set of per-scenario .pkl files that store precomputed map and route information to score each predicted trajectory. The precomputed metric cache for the navtest split is available for download on Hugging Face. To generate it yourself, run:
python navsim/planning/script/run_metric_caching.py \
train_test_split=navtest \
cache.cache_path=./data/navsim/metric_cache
This writes one metric_cache.pkl per scenario token under ./data/navsim/metric_cache/. Pass the resulting directory to the evaluation script via --metric-cache-path.
python -m src.navsim.eval \
--checkpoint-path /path/to/checkpoint \
--config-name navsim_cfg \
--dataset-path ./data/navsim/test \
--metric-cache-path ./data/navsim/metric_cache
PhysicalAI
python -m src.physicalai.eval \
--checkpoint-path /path/to/checkpoint \
--config-name physicalai_cfg
Citation
If you find DriveWAM useful, please cite:
@article{shi2026drivewam,
title={DriveWAM: Video Generative Priors Enable Scalable World-Action Modeling for Autonomous Driving},
author={Shi, Chen and Xu, Jinrui and Shi, Shaoshuai and Sheng, Kehua and Zhang, Bo and Jiang, Li},
journal={arXiv preprint arXiv:2605.28544},
year={2026}
}
Acknowledgements
We gratefully acknowledge the following open-source projects that DriveWAM builds upon: Wan2.2, LingBot-VA, NavSim, NVIDIA PhysicalAI-Autonomous-Vehicles.
- Downloads last month
- -
