Abstract: Self-supervised monocular depth estimation (MDE) has garnered significant attention in various fields, particularly for real-time applications such as autonomous driving and robotic navigation, where computational resources are limited. In these contexts, MDE must achieve a balance between lightweight design and high accuracy. Nonetheless, the existing lightweight MDE methods usually sacrifice performance for efficiency. To address this challenge, we propose MG-Mono, a Multi-Granularity method for lightweight self-supervised MDE. Specifically, we propose a Multi-Granularity Information Fusion (MGIF) module within the encoder to comprehensively capture and integrate image features at pixel, local and global granularities. In our MGIF, the global dependency of pixels is modeled by Fast Fourier Transform (FFT), which can avoid the quadratic complexity and improve the efficiency of our method. Furthermore, we introduce a Feature-weighted Consistency Loss to extract favorable semantic priors from a pre-trained semantic segmentation model to guide the feature generation of our method. This strategy enhances the feature representation without increasing inference time, leading to improved depth estimation accuracy. At last, we also propose an efficient Neighborhood-Weighted Cooperative (NWC) prediction head in the decoder to refine the depth map by leveraging local contextual depth information. Experiments on KITTI, KITTI with Improved Ground Truth and Make3D datasets demonstrate that MG-Mono achieves state-of-the-art performance while maintaining a low parameter count and high inference speed. Code is available at https://github.com/PENGFly2022/MGMono.git.
External IDs:dblp:journals/pr/WangLLYW26
Loading