Abstract: Correspondence learning aims to identify correct correspondences from the initial correspondence set and estimate camera pose between a pair of images. At present, Transformer-based methods have make notable progress in the correspondence learning task due to their powerful non-local information modeling capabilities. However, these methods seem to neglect local structures during feature aggregation from all query-key pairs, resulting in computational inefficiency and inaccurate correspondence identification. To address this issue, we propose a novel Context-aware Local and Global interaction Transformer (CLGFormer), a lightweight Transformer-based module with dual-branches that address local and global context perception in attention mechanisms. CLGFormer explores the relationship between neighborhood consistency observed in correspondences and context-aware weights appearing in vanilla attention and introduces an attention-style convolution operator. On top of that, CLGFormer also incorporates a cascaded operation that splits full features into multiple subsets and then feeds to the attention heads, which not only reduces computational costs but also enhances attention diversity. At last, we also introduce a feature recombination operate with high jointness and a lightweight channel attention module. The culmination of our efforts is the Context-aware Local and Global interaction Network (CLG-Net), which accurately estimates camera pose and identifies inliers. Through rigorous experiments, we demonstrate that our CLG-Net network outperforms existing state-of-the-art methods while exhibiting robust generalization capabilities across various scenarios. Code will be available at https://github.com/guobaoxiao/CLG.
Loading