Generalizable Multi-Camera 3D Object Detection from a Single Source via Fourier Cross-View Learning

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Improving the generalization of multi-camera 3D object detection is essential for safe autonomous driving in the real world. In this paper, we consider a realistic yet more challenging scenario, which aims to improve the generalization when only single source data available for training, as gathering diverse domains of data and collecting annotations is time-consuming and labor-intensive. To this end, we propose the Fourier Cross-View Learning (FCVL) framework including Fourier Hierarchical Augmentation (FHiAug), an augmentation strategy in the frequency domain to boost domain diversity, and Fourier Cross-View Semantic Consistency Loss to facilitate the model to learn more domain-invariant features from adjacent perspectives. Furthermore, we provide theoretical guarantees via augmentation graph theory. To the best of our knowledge, this is the first study to explore generalizable multi-camera 3D object detection with a single source. Extensive experiments on various testing domains have demonstrated that our approach achieves the best performance across various domain generalization methods.
Lay Summary: Enhancing the generalization ability of 3D object detection (adaptive capability in unseen scenarios) is of great significance for the safety of autonomous driving systems, as autonomous driving systems need to operate in dynamic environments, where training data often fails to cover all potential scenarios (such as adverse weather conditions). In this paper, we consider a realistic yet more challenging scenario, which aims to improve the generalization when only single source data available for training, as gathering diverse domains of data and collecting annotations is time-consuming and labor intensive. To address this challenge, we propose the Fourier Cross-View Learning (FCVL) framework, enhancing model generalization with single-source data. The proposed FCVL consists of two innovative components, including Fourier Hierarchical Augmentation (FHiAug), an augmentation strategy in the frequency domain to boost domain diversity, and Fourier Cross-View Semantic Consistency Loss to facilitate the model to learn more domain-invariant features from adjacent perspectives. Through these two innovations, FCVL has achieved robust multi-camera 3D detection generalization with minimal data. Theoretical analysis and real-world experiments have demonstrated its superiority over traditional methods, enabling reliable object detection in unseen environments. FCVL reduces dependence on large amounts of labeled data, and enhances safety for autonomous driving in challenging conditions.
Application-Driven Machine Learning: This submission is on Application-Driven Machine Learning.
Primary Area: Applications->Computer Vision
Keywords: Domain generalization, multi-camera 3D object detection, augmentation
Submission Number: 8727
Loading