--- license: cc-by-sa-4.0 base_model: - Wan-AI/Wan2.2-TI2V-5B language: - en pipeline_tag: image-text-to-video ---

Spatia: Video Generation with Updatable Spatial Memory

Long-horizon, spatially consistent video generation enabled by persistent 3D scene point clouds and dynamic-static disentanglement.

Jinjing Zhao*1   Fangyun Wei*2   Zhening Liu3   Hongyang Zhang4   Chang Xu1   Yan Lu2

1The University of Sydney   2Microsoft Research   3HKUST   4University of Waterloo
*Equal Contribution

arXiv   Project Page   Codes

--- ## 📖 Abstract Existing video generation models struggle to maintain long-term spatial and temporal consistency due to the dense, high-dimensional nature of video signals. To overcome this limitation, we propose **Spatia**, a spatial memory-aware video generation framework that explicitly preserves a 3D scene point cloud as persistent spatial memory. Spatia iteratively generates video clips conditioned on this spatial memory and continuously updates it through visual SLAM. This **dynamic-static disentanglement** design enhances spatial consistency throughout the generation process while preserving the model's ability to produce realistic dynamic entities. Furthermore, Spatia enables applications such as: * **Explicit Camera Control** * **3D-Aware Interactive Editing** * **Long-horizon Scene Exploration**
Spatia Teaser
--- ## Citation If you find this project useful, please cite the paper. ```tax @inproceedings{zhao2026spatia, title={Spatia: Video Generation with Updatable Spatial Memory}, author={Zhao, Jinjing and Wei, Fangyun and Liu, Zhening and Zhang, Hongyang and Xu, Chang and Lu, Yan}, booktitle={Proceedings of the IEEE/cvf conference on computer vision and pattern recognition}, year={2026} } ``` ---

© 2025 Spatia Project. Licensed under CC BY-SA 4.0.