OmniDrive:Towards Unified Next-Gen Controllable Multi-View Driving Video Generation with LLM-Guided World Model

OmniDrive:Towards Unified Next-Gen Controllable Multi-View Driving Video Generation with LLM-Guided World Model

ICLR 2026 Conference Submission19607 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Generative World Model, Autonomous Driving, Cross-View Consistency, Multi-View Video Generation

TL;DR: OminiDrive is a unified diffusion-based world model that achieves view-consistent, controllable multi-view driving video generation through joint compression, conditional modulation, and novel latent-space denoising.

Abstract: Recent diffusion-based world models can synthesize multi-camera driving videos, yet they still suffer from geometric drift between views, degrading perception, prediction and planning. We introduce OminiDrive, the first unified model that jointly compresses, generates and modulates all camera streams to deliver realistic, controllable and view-consistent driving videos. A DiT backbone operates in a shared latent manifold obtained by multi-view variational compression; within this space a consistency-aware denoiser injects correlated noise and aligns view-dependent coordinates at every diffusion step. Heterogeneous control signals—vehicle trajectory, ego pose and scene semantics—are fused through lightweight latent modulation layers, thus steering generation without extra inference cost. By reasoning over a single, view-homogeneous token grid, OminiDrive preserves both spatial coherence and temporal fidelity. Experiments on nuScenes and Waymo datasets show state-of-the-art view consistency and video quality, and the synthesized data significantly improves the performance of downstream perception models.

Supplementary Material: pdf

Primary Area: generative models

Submission Number: 19607

Loading