Abstract: Cross-view geo-localization aims at retrieving and estimating accurate geographic locations from ground images in a geo-tagged aerial image database. Existing approaches focus on two independent two-branch models to learn fine-grained representations of perspectives, neglecting to learn more discriminative representations through interactions. In this paper, we propose the GeoSSK method, which adapts the learning process of the model by learning local semantic similarity information between aerial and ground pairs via a new interaction module. We then transfer the semantic similarity knowledge learned during the interaction process to the student model through knowledge distillation. Specifically, we design a Cross-fusion Interaction Module (CIM) based on cross-attention, which learns local semantic similarity information between perspectives to adjust the learning of the model. Meanwhile, considering the presence of visual distractions in complex environments, we adjust the degree of interaction between perspectives by the Contribution Factor (CF) of the local representation to the global representation. In addition, we introduce Semantic Similarity Knowledge Distillation (SSKD) between teachers and students for cross-view geo-localization. The interaction learning model serves as the teacher, transferring its semantic similarity knowledge to the student. At the same time, we designed an Incorrect Knowledge Filter (IKF) to filter incorrect knowledge of teachers. Experimental results demonstrate the effectiveness and competitive performance of GeoSSK.
Loading