MS360: A Multi-Scale Feature Fusion Framework for 360 Monocular Depth Estimation

Published: 13 May 2024, Last Modified: 28 May 2024GI 2024 SDEveryoneRevisionsBibTeXCC BY 4.0
Letter Of Changes: • Referred to references using model names or author names in the related work section • Discussed related methods drawbacks in the subsection ERP images and, cube map and ERP images of section 2.2 • Figure 2 caption changed to single line • Capitalized all Equation, Figure, and Table when referring them directly in the text • Format of text labels in equations changed to make it uniform
Keywords: 360, Depth, Monocular
Abstract: Panorama images are popularly used for comprehensive scene understanding due to their integrated field of view. To overcome the spherical image distortions observed in commonly used Equirectangular Projection (ERP) 360-format images, the existing 360 monocular deep learning-based depth estimation networks propose using distortion-free tangent patch images projected from ERP to predict perspective depths which are merged to get the final ERP depth map. These methods show improved performance over previous methods; however, they produce depth maps that are inconsistent, and uneven, have merging artifacts, and miss fine structure details due to the missing holistic contextual information in the learned local tangent patch image features. To address this problem, we propose a novel multi-scale 360 monocular depth estimation framework, MS360, which focuses on guiding the local tangent perspective image features with coarse integrated image features. Specifically, our method first extracts coarse comprehensive features with perspective tangent patches from downsampled ERP as input to the coarse UNet structure. Secondly, we use a fine branch network to capture local geometric information using perspective tangent images from high-resolution ERP. Furthermore, we present a Multi-Scale Feature Fusion (MSFF) bottleneck module to fuse and guide the fine local features with coarse holistic features via an attention mechanism. Lastly, we predict a low-resolution depth map using coarse features and a final high-resolution depth map using coarse-guided fine image features as input to the coarse and fine decoder networks. Our method greatly reduces the discrepancies, and local patch merging artifacts in the depth maps. Performed experiments on multiple real-world depth estimation benchmark datasets show that our network outperforms the existing models both quantitatively and qualitatively while producing smooth and high-quality depth maps.
Submission Number: 34
Loading