Keywords: View Transformation, Bird's-Eye-View Representation, Multi-View Representation, View-Agnostic
Abstract: Bird's-Eye View representations are essential for 3D perception in autonomous driving, providing unified and spatially coherent scene understanding. While attention-based methods achieve strong performance through global cross-view attention, they suffer from computational inefficiencies due to redundant referencing and spatial ambiguity from ego-centric projections. To address these limitations, we introduce Mosaic View Transformation (MosaicVT), a modular framework that independently transforms multi-camera views into a unified BEV space. MosaicVT employs a camera-centric polar coordinate system, effectively resolving directional ambiguity and reducing cross-view redundancy. A novel view-agnostic positional embedding enables a single transformation module to generalize across heterogeneous camera configurations without retraining. Transformed camera-centric representations are then aligned and fused into a global BEV using a geometry-aware interpolation strategy, significantly reducing computational overhead compared to global attention mechanisms. Experimental results on the nuScenes benchmark demonstrate that MosaicVT achieves state-of-the-art performance in 3D object detection and BEV semantic segmentation while providing substantial reductions in latency and maintaining robust generalization across diverse camera setups.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 11743
Loading