Spidepth: strengthened pose information for self-supervised monocular depth estimation

Mykola Lavreniuk, Alla Lavreniuk

Published: 17 Apr 2024, Last Modified: 12 Nov 2025OpenReview Archive Direct UploadEveryoneCC BY 4.0

Abstract: Self-supervised monocular depth estimation has garnered significant attention for its applications in autonomous driving and robotics. While recent methods have focused on improving depth networks, they often overlook the role of pose estimation, treating it as a secondary component. In this paper, we introduce SPIdepth, a novel approach that enhances pose network design to improve depth estimation without increasing model complexity or inference cost. Building upon SQLdepth, SPIdepth replaces the smaller, randomly initialized PoseNet with a larger, pretrained PoseNet, leveraging representations learned from large-scale datasets. This stabilizes motion estimation during training and leads to improvements in depth prediction, even without increasing inference-time cost. Moreover, SPIdepth first pretrains the PoseNet for accurate image warping before jointly optimizing it with the depth network. Extensive experiments on KITTI, Cityscapes, and Make3D demonstrate that SPIdepth surpasses prior methods by significant margins. On KITTI, SPIdepth achieves the lowest AbsRel (0.029), SqRel (0.069), and RMSE (1.394), establishing a new state-of-the-art. On Cityscapes, SPIdepth improves upon SQLdepth by 21.7% in AbsRel, 36.8% in SqRel, and 16.5% in RMSE, even without motion masks. Moreover, SPIdepth outperforms all models in zero-shot evaluation on Make3D. Beyond traditional benchmarks, SPIdepth ranks first in the NTIRE 2025 HR Mono Depth Challenge, achieving 97.6% Delta 1.05 validation accuracy on transparent and mirror surfaces. This underscores its robustness in handling challenging non-Lambertian surfaces and its effectiveness in real-world depth estimation. Remarkably, SPIdepth uses only a single image for inference and still outperforms video-based methods, highlighting its practical efficiency and scalability for real-world deployment. Our findings highlight the importance of strengthened pose information in advancing self-supervised depth estimation. The code and pre-trained models are available at https://github.com/Lavreniuk/SPIdepth.