Abstract: The task of detecting and describing key points is crucial in visual applications as it involves estimating stable locations and discriminative representations of local features. However, the use of rough hard positive or negative labels generated from one-to-one correspondences among images can lead to indistinguishable samples such as false positives or negatives, which provide inconsistent supervision. This mixture of false and hard samples hinders neural networks from accurately learning descriptions for matching purposes.
To address this challenge, we propose a method that learns transformation-predictive representations through self-supervised contrastive learning. Our approach maximizes the similarity between corresponding views of the same 3D point (landmark) without using negative sample pairs to avoid collapsing solutions. Additionally, we employ self-supervised generation learning and curriculum learning techniques to convert hard positive labels into soft continuous targets. The incorporation of aggressively updated soft labels helps overcome training bottlenecks caused by label noise from false positives and facilitates model training under a stronger transformation paradigm.
Our self-supervised training pipeline significantly reduces computation load and memory usage while achieving superior performance compared to state-of-the-art methods on standard image matching benchmarks. This demonstrates excellent generalization capability across multiple downstream tasks.
0 Replies
Loading