Keywords: autonomous driving, multi-view synthesis, BEV representation
TL;DR: We introduce BEV-VAE, a variational autoencoder that unifies multi-view images into a BEV representation for scalable and generalizable autonomous driving scene synthesis.
Abstract: Generative modeling has shown remarkable success in vision and language, inspiring research on synthesizing autonomous driving scenes.
Existing multi-view synthesis approaches commonly operate in image latent spaces with cross-attention to enforce spatial consistency, but they are tightly bound to camera configurations, which limits dataset scalability and model generalization.
We propose BEV-VAE, a variational autoencoder that unifies multi-view images into a compact bird’s-eye-view (BEV) representation, enabling encoding from arbitrary camera layouts and decoding to any desired viewpoint.
Through multi-view image reconstruction and novel view synthesis, we show that BEV-VAE effectively fuses multi-view information and accurately models spatial structure.
This capability allows it to generalize across camera configurations and facilitates scalable training on diverse datasets.
Within the latent space of BEV-VAE, a Diffusion Transformer (DiT) generates BEV representations conditioned on 3D object layouts, enabling multi-view image synthesis with enhanced spatial consistency on nuScenes and achieving the first complete seven-view synthesis on AV2.
Finally, synthesized imagery significantly improves the perception performance of BEVFormer, highlighting the utility of scalable and generalizable scene synthesis for autonomous driving.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Submission Number: 9179
Loading