Abstract: To estimate depth maps from monocular videos in a self-supervised way, existing methods simultaneously predict the pose changes between adjacent frames and the depth maps of each frame, and then reconstruct the forward or backward frames using them, thereby casting depth estimation as a frame reconstruction problem. The corresponding reconstruction loss, which serves as a key supervision signal for training the whole network, can adversely affect the depth estimation accuracy if it is not properly established. In this paper, we propose a novel self-supervised monocular depth estimation method from videos via adaptive reconstruction constraints, i.e., designing the loss functions by establishing more accurate reconstruction constraints. Specifically, we first propose a pose-adaptive reconstruction loss to adaptively select the optimal pose parameterizations that yield the minimum reconstruction errors, reducing the impact of inaccurate posture on frame reconstruction. Then, we propose a region-sensitive reconstruction loss that fully utilizes the pretrained image reconstruction model to adaptively identify the poorly reconstructed regions and characterize the deviation of these regions on feature space. Finally, we additionally construct a multi-frame depth estimation network and design a reconstruction-guided bidirectional distillation loss to adaptively adjust the direction of distillation between networks of multi-frame and monocular depth estimation based on their current reconstruction quality, which encourages them to learn from each other and benefits the core task of monocular depth estimation. With our proposed losses, we achieve superior performance in comparison with state-of-the-art methods on benchmark datasets.
Loading