MS360: A Multi-Scale Feature Fusion Framework for 360 Monocular Depth Estimation

Payal Mohadikar, Chuanmao Fan, Ye Duan

Published: 2024, Last Modified: 16 Oct 2025Graphics Interface 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Panorama images are popularly used for comprehensive scene understanding due to their integrated field of view. To overcome the spherical image distortions observed in commonly used Equirectangular Projection (ERP) 360-format images, the existing 360 monocular deep learning-based depth estimation networks propose using distortion-free tangent patch images projected from ERP to predict perspective depths which are merged to get the final ERP depth map. These methods show improved performance over previous methods; however, they produce depth maps that are inconsistent, and uneven, have merging artifacts, and miss fine structure details due to the missing holistic contextual information in the learned local tangent patch image features. To address this problem, we propose a novel multi-scale 360 monocular depth estimation framework, MS360, which focuses on guiding the local tangent perspective image features with coarse integrated image features. Specifically, our method first extracts coarse comprehensive features with perspective tangent patches from downsampled ERP as input to the coarse UNet structure. Secondly, we use a fine branch network to capture local geometric information using perspective tangent images from high-resolution ERP. Furthermore, we present a Multi-Scale Feature Fusion (MSFF) bottleneck module to fuse and guide the fine local features with coarse holistic features via an attention mechanism. Lastly, we predict a low-resolution depth map using coarse features and a final high-resolution depth map using coarse-guided fine image features as input to the coarse and fine decoder networks. Our method greatly reduces the discrepancies, and local patch merging artifacts in the depth maps. Performed experiments on multiple real-world depth estimation benchmark datasets show that our network outperforms the existing models both quantitatively and qualitatively while producing smooth and high-quality depth maps.