Distortion-Aware Self-Supervised Indoor 360$^{\circ }$ Depth Estimation via Hybrid Projection Fusion and Structural Regularities

Xu Wang, Weifeng Kong, Qiudan Zhang, You Yang, Tiesong Zhao, Jianmin Jiang

Published: 2024, Last Modified: 11 Nov 2024IEEE Trans. Multim. 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Owing to the rapid development of emerging 360 $^{\circ }$ panoramic imaging techniques, indoor 360 $^{\circ }$ depth estimation has aroused extensive attention in the community. Due to the lack of available ground truth depth data, it is extremely urgent to model indoor 360 $^{\circ }$ depth estimation in self-supervised mode. However, self-supervised 360 $^{\circ }$ depth estimation suffers from two major limitations. One is the distortion and network training problems caused by Equirectangular projection (ERP), and the other is that texture-less regions are quite difficult to back-propagate in self-supervised mode. Hence, to address the above issues, we introduce spherical view synthesis for learning self-supervised 360 $^{\circ }$ depth estimation. Specifically, to alleviate the ERP-related problems, we first propose a dual-branch distortion-aware network to produce the coarse depth map, including a distortion-aware module and a hybrid projection fusion module. Subsequently, the coarse depth map is utilized for spherical view synthesis, in which a spherically weighted loss function for view reconstruction and depth smoothing is investigated to optimize the projection distribution problem of 360 $^{\circ }$ images. In addition, two structural regularities of indoor 360 $^{\circ }$ scenes are devised as two additional supervisory signals to efficiently optimize our self-supervised 360 $^{\circ }$ depth estimation model, containing the principal-direction normal constraint and the co-planar depth constraint. The principal-direction normal constraint is designed to align the normal of the 360 $^{\circ }$ image with the direction of the vanishing points. Meanwhile, we employ the co-planar depth constraint to fit the estimated depth of each pixel through its 3D plane. Finally, a depth map is obtained for the 360 $^{\circ }$ image. Experimental results illustrate that our proposed method achieves superior performance than the current advanced depth estimation methods on four publicly available datasets.