Self-Supervised Monocular Depth Estimation with Effective Feature Fusion and Self Distillation

Zhenfei Liu, Chengqun Song, Jun Cheng, Jiefu Luo, Xiaoyang Wang

Published: 2024, Last Modified: 13 Feb 2025IROS 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Monocular depth estimation obtaining scene depth information from a single image is an important task in the field of computer vision. Constrained by the limitations of convolutional networks in conducting long-distance modeling and the underutilization of datasets, the generalization of existing models is not satisfactory. In this paper, we propose an adaptive backbone named Internal Fusion Transformer to improve generalization ability compared to convolutional backbone, like HRNet, and a Bilateral Attention module which pays more attention to low-level semantic features compared to previous fuse methods. Meanwhile, we introduce three data augmentation methods, namely cropping-resizing (cr), cropping-shuffling (cs), and mirroring (mi), for self distillation, as well as discuss their contributions to model performance improvement. Our model is trained on the KITTI dataset, and without fine-tuning, tested on NYUv2 and Make3D datasets to confirm the generalization. The experimental results illustrate the effectiveness of our design. Our model also demonstrates better performance compared to other models on the KITTI dataset.