OmniDrive:Towards Unified Next-Gen Controllable Multi-View Driving Video Generation with LLM-Guided World Model
Keywords: Generative World Model, Autonomous Driving, Cross-View Consistency, Multi-View Video Generation
TL;DR: OminiDrive is a unified diffusion-based world model that achieves view-consistent, controllable multi-view driving video generation through joint compression, conditional modulation, and novel latent-space denoising.
Abstract: Recent diffusion-based world models can synthesize multi-camera driving videos, yet they still suffer from geometric drift between views, degrading perception, prediction and planning. We introduce OminiDrive, the first unified model that jointly compresses, generates and modulates all camera streams to deliver realistic, controllable and view-consistent driving videos. A DiT backbone operates in a shared latent manifold obtained by multi-view variational compression; within this space a consistency-aware denoiser injects correlated noise and aligns view-dependent coordinates at every diffusion step. Heterogeneous control signals—vehicle trajectory, ego pose and scene semantics—are fused through lightweight latent modulation layers, thus steering generation without extra inference cost. By reasoning over a single, view-homogeneous token grid, OminiDrive preserves both spatial coherence and temporal fidelity. Experiments on nuScenes and Waymo datasets show state-of-the-art view consistency and video quality, and the synthesized data significantly improves the performance of downstream perception models.
Supplementary Material: pdf
Primary Area: generative models
Submission Number: 19607
Loading