Abstract: In recent years, unsupervised deep learning approaches have received significant attention to estimating the depth and visual odometry (VO) from unlabelled monocular image sequences. However, their performance is limited in challenging environments due to perceptual degradation, occlusions, and rapid motions. Moreover, the existing unsupervised methods suffer from the lack of scale-consistency constraints across frames, which causes that the VO estimators fail to provide persistent trajectories over long sequences. In this study, we propose an unsupervised monocular deep VO framework that predicts a six-degrees-of-freedom pose camera motion and depth map of the scene from unlabelled RGB image sequences. We provide detailed quantitative and qualitative evaluations of the proposed framework on a) a challenging dataset collected during the DARPA Subterranean challenge <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">1</sup> ; and b) the benchmark KITTI and Cityscapes datasets. The proposed approach significantly outperforms state-of-the-art unsupervised deep VO and depth prediction methods under perceptually degraded conditions providing better results for both pose estimation and depth recovery. Furthermore, it achieves state-of-the-art results in most of the VO and depth metrics on benchmark datasets. The presented approach is part of the solution used by the COSTAR team participating in the DARPA Subterranean Challenge.
0 Replies
Loading