Abstract: In this paper, we present a framework for end-to-end monocular visual odometry based on Transformer. By concatenating the original optical flow with an intrinsics layer, we provide the optical flow with spatial scale information. In addition, we propose the shifted-window-based self-attention based on the motion distribution in the optical flow, which helps to focus the multi-head self-attention on the localized region. The full utilization of local motion and global motion information is achieved by the shifted windows. Experiments demonstrate that our model outperforms current learning-based monocular visual odometry on most of the test sequences in the KITTI and TartanAir datasets, and the results are competitive with geometry-based methods.
Loading