Physical Informed Driving World Models

Zhuoran Yang; Yanyong Zhang

Physical Informed Driving World Models

Zhuoran Yang, Yanyong Zhang

Published: 02 Mar 2026, Last Modified: 15 Apr 2026ICLR 2026 Workshop World ModelsEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Video Generation

TL;DR: A physically grounded world model that generates multi-view driving videos with accurate motion, temporal consistency, and spatial relationships.

Abstract: Autonomous driving requires robust perception models trained on high-quality, large-scale multi-view driving videos for tasks like 3D object detection, segmentation, and trajectory prediction. While world models provide a cost-effective solution for generating realistic driving videos, challenges remain ensuring that these videos adhere to fundamental physical principles, such as relative and absolute motion, spatial relationships like occlusion and spatial consistency, and temporal consistency. To address these, we propose **DrivePhysica**, an innovative model designed to generate realistic multi-view driving videos that accurately adhere to essential physical principles through three key advancements: (1) a Coordinate System Aligner module that integrates relative and absolute motion features to enhance motion interpretation, (2) an Instance Flow Guidance module that ensures precise temporal consistency via efficient 3D flow extraction, and (3) a Box Coordinate Guidance module that improves spatial relationship understanding and accurately resolves occlusion hierarchies. Grounded in physical principles, we achieve state-of-the-art performance in driving video generation quality and downstream perception tasks.

Submission Number: 101

Loading