Abstract: In this paper, we present a novel deep learning and direct method based monocular visual odometry system named D2VO. Our system reconstructs the dense depth map of each keyframe and tracks camera poses based on these keyframes. Combining direct method and deep learning, both tracking and mapping of the system could benefit from the geometric measurement and semantic information. For each input frame, a feature pyramid is built and shared by both tracking and mapping process. The depth map of keyframe is efficiently estimated from coarse to fine with the followed multi-view hierarchical depth estimation network. We optimize the camera pose by minimizing photometric error between re-projected features of each frame and its reference keyframe with bundle adjustment. Experimental results on TUM dataset demonstrate that our approach outperforms the state-of-the-art methods on both tracking and mapping.
Loading