Efficient Multi-View 3D Representation via Fusion of View-Agnostic Transformations

Jaemyung Yu; Jiwan Hur; Dong-Jae Lee; Jaehoon Cho; Junmo Kim

Efficient Multi-View 3D Representation via Fusion of View-Agnostic Transformations

Jaemyung Yu, Jiwan Hur, Dong-Jae Lee, Jaehoon Cho, Junmo Kim

18 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: View Transformation, Bird's-Eye-View Representation, Multi-View Representation, View-Agnostic

Abstract: Bird's-Eye View representations are essential for 3D perception in autonomous driving, providing unified and spatially coherent scene understanding. While attention-based methods achieve strong performance through global cross-view attention, they suffer from computational inefficiencies due to redundant referencing and spatial ambiguity from ego-centric projections. To address these limitations, we introduce Mosaic View Transformation (MosaicVT), a modular framework that independently transforms multi-camera views into a unified BEV space. MosaicVT employs a camera-centric polar coordinate system, effectively resolving directional ambiguity and reducing cross-view redundancy. A novel view-agnostic positional embedding enables a single transformation module to generalize across heterogeneous camera configurations without retraining. Transformed camera-centric representations are then aligned and fused into a global BEV using a geometry-aware interpolation strategy, significantly reducing computational overhead compared to global attention mechanisms. Experimental results on the nuScenes benchmark demonstrate that MosaicVT achieves state-of-the-art performance in 3D object detection and BEV semantic segmentation while providing substantial reductions in latency and maintaining robust generalization across diverse camera setups.

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 11743

Loading