FMRT: Learning Accurate Feature Matching With Reconciliatory Transformer

Li Wang; Xinyu Zhang; Zhiqiang Jiang; Kun Dai; Tao Xie; Lei Yang; Wenhao Yu; Yang Shen; Bin Xu; Jun Li

FMRT: Learning Accurate Feature Matching With Reconciliatory Transformer

Li Wang, Xinyu Zhang, Zhiqiang Jiang, Kun Dai, Tao Xie, Lei Yang, Wenhao Yu, Yang Shen, Bin Xu, Jun Li

Published: 01 Jan 2025, Last Modified: 16 May 2025IEEE Trans Autom. Sci. Eng. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Local Feature Matching, a pivotal component of numerous computer vision tasks (e.g., structure from motion and visual localization), has been effectively addressed by Transformer-based methods. Nevertheless, these methods solely incorporate long-range context information among keypoints with a fixed receptive field, which constrains the network from appropriately reconciling the importance of features with diverse receptive fields to realize complete image perception, hence limiting feature matching accuracy. In addition, these methods employ a conventional handcrafted encoding approach to incorporate positional information of keypoints into visual descriptors, which limits the capability of networks to extract effective positional encoding message. In this study, we propose FMRT, a novel detector-free method that reconciles local features with diverse receptive fields adaptively and utilizes parallel networks to realize reliable positional encoding. Specifically, FMRT proposes a dedicated reconciliatory transformer (RecFormer) that contains a global perception attention layer to identify visual descriptors with different receptive fields and integrate global context information under various scales, a perception weight layer to measure the importance of various receptive fields adaptively, and a local perception feed-forward network to extract deep aggregated multi-scale local feature representation. Moreover, we introduce a novel axis-wise position encoder (AWPE) that views positional encoding as two keypoints encoding tasks along the row and column dimensions, decouples the x- and y-coordinates of keypoints into two independent 1D vectors, and designs two parallel network branches to explicitly encodes geometric correlations among keypoints, hence realizing reliable positional encoding. Extensive experiments indicate that FMRT yields impressive performance on multiple tasks, including relative pose estimation, visual localization, homography estimation, and image matching. Besides, we integrate FMRT into a localization framework and conduct a visual localization experiment in a real scene, which further demonstrate the superiority of FMRT. Note to Practitioners—This paper presents a novel approach to enhancing the performance of local feature matching in computer vision tasks. Traditional methods often rely on fixed receptive fields for integrating context among keypoints, which can limit the perception of the complete image and, consequently, the precision of feature matching. Our work introduces a Reconciliatory Transformer that not only addresses these limitations by effectively reconciling the importance of features across varying receptive fields but also improves the integration of positional information into visual descriptors. The techniques developed here can be adapted to a wide range of systems, e.g., image matching for computer vision and visual localization for autonomous driving, offering practitioners a tool to significantly improve the fidelity of feature matching, which is foundational for accurate interaction with the surrounding environment.

Loading