Efficient Multi-View Driving Scenes Generation Based on Video Diffusion Transformer

Junpeng Jiang; Gangyi Hong; Hengtong Hu; Lijun Zhou; Tianyi Yan; Yida Wang; Kun Zhan; Peng Jia; XianPeng Lang; Miao Zhang

Efficient Multi-View Driving Scenes Generation Based on Video Diffusion Transformer

Junpeng Jiang, Gangyi Hong, Hengtong Hu, Lijun Zhou, Tianyi Yan, Yida Wang, Kun Zhan, Peng Jia, XianPeng Lang, Miao Zhang

Published: 06 Mar 2025, Last Modified: 14 Apr 2025ICLR 2025 DeLTa Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0

Track: long paper (up to 8 pages)

Keywords: Diffusion Model, Video Generation, Efficient Inference

Abstract: Collecting multi-view driving scenario videos to enhance the performance of 3D visual perception tasks presents significant challenges and incurs substantial costs, making generative models for realistic data an appealing alternative. Yet, the videos generated by recent works suffer from poor quality and temporal consistency, which restricts their effectiveness in advancing perception tasks under driving scenarios. This gap highlights the need for a more robust and versatile framework capable of generating high-fidelity and temporally consistent multi-view videos, tailored to the complexities of driving scenarios. We introduce DiVE, a framework based on the Diffusion Transformer (DiT), designed to generate videos that are both temporally and cross-view consistent, aligning seamlessly with bird's-eye view (BEV) layouts and textual descriptions. Specifically, DiVE leverages cross-attention and a SketchFormer to exert precise control over multimodal data, while incorporating a view-inflated attention mechanism that adds no extra parameters, thereby guaranteeing consistency across views. To address the computational costs associated with high-resolution video generation, we further propose a training-free sampling strategy for acceleration called Resolution Progressively Sampling, achieving a remarkable $\times$1.62 speedup without compensating the generation quality. In summary, DiVE delivers multi-view videos with outstanding visual quality and has demonstrated state-of-the-art performance on the nuScenes dataset. Additionally, the highly efficient and robust generation capabilities of DiVE offer promising avenues to support 3D perception models in achieving substantial performance improvements.

Submission Number: 89

Loading