Abstract: Traditional geometric methods estimate camera motion trajectories by analyzing image feature points or pixel information, demonstrating robust performance in certain scenarios. However, these approaches struggle in low-texture and highly dynamic environments. In contrast, end-to-end neural networks can directly predict camera motion trajectories from consecutive image frames, eliminating the need for complex feature extraction and matching processes. Nonetheless, the accuracy of this end-to-end approach still has a significant gap compared to traditional methods. We designed GD-MVO, a monocular visual odometry system that combines the strengths of two approaches. Our system uses both monocular and stereo image pairs to train a deep neural network for depth prediction, then integrates this depth information with geometric features extracted from sequential images to achieve more accurate visual positioning. In this work, we propose 1) a novel integration of geometric methods and deep learning for robust visual odometry; 2) an effective strategy to mitigate the impact of dynamic feature points on prediction outcomes; 3) a fly-out mask technique to resolve the fly-out padding issue in monocular depth prediction networks. Experimental results show that our GD-MVO not only exhibits excellent stability in high-dynamic scenes but also maintains reliable running performance on our specially customized low-texture dataset. In the KITTI odometry benchmark, GD-MVO outperforms both DF-VO and TSformer-VO across all metrics. Most impressively, it achieves a translation error of merely 1.778%, surpassing DF-VO (6.747%) and RAUM-VO (3.676%). Source code available at https://github.com/lilili996/GD-MVO
External IDs:dblp:journals/access/LiCWWF25
Loading