Abstract: In this work, a simple yet effective deep neural network is proposed to generate the dense depth map of the scene by exploiting both LiDAR sparse point cloud and the monocular camera image. Specifically, a feature pyramid network is firstly employed to extract feature maps from images across time. Then the relative pose is calculated by minimizing the feature distance between aligned pixels from inter-frame feature maps. Finally, the feature maps and the relative pose are further applied to compute the feature-metric loss for training the depth completion network. The key novelty of this work lies in that a self-supervised mechanism is presented to train the depth completion network by directly using visual-LiDAR odometry between consecutive frames. Comprehensive experiments and ablation studies on benchmark dataset KITTI demonstrate the superior performance over other state-of-the-art methods in terms of pose estimation and depth completion. The detailed performance of the proposed approach (referred to as <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">SelfCompDVLO</i> ) can be found on the KITTI depth completion benchmark. The source code, models, and data have been made available at GitHub.
0 Replies
Loading