D2Former: Jointly Learning Hierarchical Detectors and Contextual Descriptors via Agent-Based Transformers

Abstract: Establishing pixel-level matches between image pairs is vital for a variety of computer vision applications. How-ever, achieving robust image matching remains challenging because CNN extracted descriptors usually lack discrim-inative ability in texture-less regions and keypoint detec-tors are only good at identifying keypoints with a specific level of structure. To deal with these issues, a novel im-age matching method is proposed by Jointly Learning Hier-archical Detectors and Contextual Descriptors via Agent-based Transformers (D <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">2</sup> Former), including a contextual feature descriptor learning (CFDL) module and a hierar-chical keypoint detector learning (HKDL) module. The proposed D <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">2</sup> Former enjoys several merits. First, the pro-posed CFDL module can model long-range contexts effi-ciently and effectively with the aid of designed descriptor agents. Second, the HKDL module can generate keypoint detectors in a hierarchical way, which is helpful for detecting keypoints with diverse levels of structures. Extensive experimental results on four challenging benchmarks show that our proposed method significantly outperforms state-of-the-art image matching methods.
0 Replies
Loading