HRDFuse: Monocular 360$^\circ$ Depth Estimation by Collaboratively Learning Holistic-with-Regional Depth DistributionsDownload PDF

22 Sept 2022 (modified: 13 Feb 2023)ICLR 2023 Conference Withdrawn SubmissionReaders: Everyone
Keywords: 3D Computer Vision, Scene Analysis and Understanding, Depth distribution classification, Feature representation learning
TL;DR: This paper proposed a novel solution for monocular 360$^\circ$ depth estimation, which predicts an ERP format depth map by collaboratively learning the holistic-with-regional information from the ERP image and its TP patches.
Abstract: Depth estimation from a monocular 360$^\circ$ image is a burgeoning problem as a 360$^\circ$ image provides holistic sensing of a scene with a wide field of view. Recently, some methods, \eg, OmniFusion, have applied the tangent projection (TP) to represent a 360$^\circ$ image and predicted depth values via patch-wise regressions, which are merged to get a depth map with equirectangular projection (ERP) format. However, these methods suffer from 1) non-trivial process of merging a large number of patches; 2) less smooth and accurate depth results caused by ignoring the holistic contextual information contained only in the ERP image and directly regressing the depth value of each pixel. In this paper, we propose a novel framework, HRDFuse, that subtly combines the potential of convolutional neural networks (CNNs) and transformers by collaboratively learning the holistic contextual information from the ERP and the regional structural information from the TP. Firstly, we propose a spatial feature alignment (SFA) module that learns feature similarities between the TP and ERP to aggregate the TP features into a complete ERP feature map in a pixel-wise manner. Secondly, we propose a collaborative depth distribution classification CDDC module that learns the holistic-with-regional histograms capturing the ERP and TP depth distributions. As such, the final depth values can be predicted as a linear combination of histogram bin centers. Lastly, we adaptively combine the depth predictions from two projections to obtain the final depth map. Extensive experiments on three benchmark datasets show that our method achieves more smooth and accurate depth results while favorably surpassing the SOTA methods by a significant margin.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics
Submission Guidelines: Yes
Please Choose The Closest Area That Your Submission Falls Into: Applications (eg, speech processing, computer vision, NLP)
Supplementary Material: zip
4 Replies

Loading