Scalable and Generalizable Autonomous Driving Scene Synthesis

ICLR 2026 Conference Submission9179 Authors

17 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: autonomous driving, multi-view synthesis, BEV representation
TL;DR: We introduce BEV-VAE, a variational autoencoder that unifies multi-view images into a BEV representation for scalable and generalizable autonomous driving scene synthesis.
Abstract: Generative modeling has shown remarkable success in vision and language, inspiring research on synthesizing autonomous driving scenes. Existing multi-view synthesis approaches commonly operate in image latent spaces with cross-attention to enforce spatial consistency, but they are tightly bound to camera configurations, which limits dataset scalability and model generalization. We propose BEV-VAE, a variational autoencoder that unifies multi-view images into a compact bird’s-eye-view (BEV) representation, enabling encoding from arbitrary camera layouts and decoding to any desired viewpoint. Through multi-view image reconstruction and novel view synthesis, we show that BEV-VAE effectively fuses multi-view information and accurately models spatial structure. This capability allows it to generalize across camera configurations and facilitates scalable training on diverse datasets. Within the latent space of BEV-VAE, a Diffusion Transformer (DiT) generates BEV representations conditioned on 3D object layouts, enabling multi-view image synthesis with enhanced spatial consistency on nuScenes and achieving the first complete seven-view synthesis on AV2. Finally, synthesized imagery significantly improves the perception performance of BEVFormer, highlighting the utility of scalable and generalizable scene synthesis for autonomous driving.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Submission Number: 9179
Loading