STViT: Improving Self-supervised Multi-Camera Depth Estimation with Spatial-Temporal Context and Adversarial Geometry Regularization

STViT: Improving Self-supervised Multi-Camera Depth Estimation with Spatial-Temporal Context and Adversarial Geometry Regularization

TMLR Paper1876 Authors

28 Nov 2023 (modified: 17 Sept 2024)Rejected by TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: Multi-camera depth estimation has recently garnered significant attention due to its practical implications in autonomous driving. While adapting monocular self-supervised methods to the multi-camera context has demonstrated promise, these techniques often overlook unique challenges specific to multi-camera setups, hindering the realization of their full potential. In this paper, we delve into the task of self-supervised multi-camera depth estimation and propose an innovative Transformer-based framework, STViT, featuring several noteworthy enhancements: 1) The Spatial-Temporal Transformer (STTrans) is designed to exploit local spatial connectivity and global context within image features, facilitating the learning of enriched spatial-temporal cross-view correlations and effectively recovering intricate 3D geometries. 2) To alleviate the adverse impact of varying illumination conditions in photometric loss calculation, we employ a spatial-temporal photometric consistency correction strategy (STPCC) to adjust the image intensities and maintain brightness consistency across frames. 3) In recognition of the profound impact of adverse conditions such as rainy weather and nighttime driving on depth estimation, we propose an Adversarial Geometry Regularization (AGR) module based on Generative Adversarial Networks. The AGR serves to provide added spatial positional constraints on depth estimation by leveraging unpaired normal-condition depth maps, effectively preventing improper model training in adverse conditions. Our approach is extensively evaluated on large-scale autonomous driving datasets, including Nuscenes and DDAD, demonstrating its superior performance, thus advancing the state-of-the-art in multi-camera self-supervised depth estimation.

Submission Length: Long submission (more than 12 pages of main content)

Changes Since Last Submission: For Reviewer eFWp: 1. Add the required description of related works. 2. Add detailed descriptions of the Spatial-Temporal Transformer in Section 3.1.2 3. Add more visualization comparisons in Figure 7. For Reviewer Eejt: 1. Add the required error bar in Table 1 and Table 2 for evaluation results of our methods and other state-of-the-art methods. 2. Add an ablation study for overlapping proportions in Section 4.5.3, Figure 8, and Table 9. 3. Add an experiment for model computational efficiency in Section 4.7 and Table 11. 4. Add more qualitative comparisons in Figure 9. For Reviewer 6Sxt: 1. Add more details for the discriminator of AGR in Section 3.3.

Assigned Action Editor: ~Wei_Liu3

Submission Number: 1876

Loading