Abstract: Two-view correspondence pruning aims to accurately remove incorrect correspondences (outliers) from initial ones. Graph Neural Networks (GNNs) incorporated by Multilayer Perceptrons (MLPs) are treated as a powerful manner to handle sparse and unevenly distributed data.
However, the expression capability of correspondence features obtained by MLPs is limited by their inherent insufficient of context information.
In addition, previous works directly utilize the outputs of off-the-shelf GNNs, thus leading to confusion between sparse correspondence attribute features and their global structural information.
To alleviate these issues, we propose a two-view correspondence pruning network TrGa. Specifically, we firstly use complete Transformer structures instead of context-agnostic MLPs to capture correspondence features with global context information and stronger expression capability. After that, we introduce the Concatenation Graph Node and Global Structure (CGNS) block to separately capture the interaction patterns among sparse correspondence attribute features and the global structural information among them, which can prevent their confusion. Finally, the proposed Feature Dimension Transformation and Enhancement (FDTE) block is applied for dimension transformation and feature augmentation. Additionally, we propose an efficient variant C-TrGa, in which the similarity matrix of the proposed C-Transformer is computed along the channel dimension. Extensive experiments demonstrate that the proposed TrGa and C-TrGa outperform state-of-the-art methods in different computer vision tasks. The code is provided in the supplementary materials.
Primary Subject Area: [Engagement] Multimedia Search and Recommendation
Secondary Subject Area: [Engagement] Multimedia Search and Recommendation, [Content] Vision and Language, [Experience] Multimedia Applications
Relevance To Conference: This work makes significant contributions to multimedia and multimodal processing by addressing the challenges of correspondence pruning within these domains. The proposed TrGa first stacks complete Transformer structures as the Correspondence Feature Extractor and then performs Global Graph Construction using the proposed CGNS block and FDTE block. The introduced CGNS block can can separately capture the interaction patterns among sparse correspondence attribute features and the global structural information among sparse correspondences, to prevent their confusion.
Additionally, a variant of the standard Transformer is proposed and integrated into C-Transformer, effectively reducing theoretical time complexity and parameter size, making it more suitable for real-world multimedia applications with resource constraints. The capability to handle sparse correspondences and global structural information enhances the model robustness and generalization across diverse multimedia data types and modalities. Therefore, this work advances the state-of-the-art in multimedia/multimodal processing by providing efficient and effective solutions for correspondence pruning, contributing to improved performance in various applications such as content-based retrieval, image registration, and video analysis.
Supplementary Material: zip
Submission Number: 2703
Loading