Abstract: The goal of street-to-aerial cross-view image geo-localization is to determine the location of the query street-view image by retrieving the aerial-view image from the same place. The drastic viewpoint and appearance gap between the aerial-view and the street-view images brings a huge challenge against this task. In this paper, we propose a novel multiscale attention encoder to capture the multiscale contextual information of the aerial/street-view images. To bridge the domain gap between these two view images, we first use an inverse polar transform to make the street-view images approximately aligned with the aerial-view images. Then, the explored multiscale attention encoder is applied to convert the image into feature representation with the guidance of the learnt multiscale information. Finally, we propose a novel global mining strategy to enable the network to pay more attention to hard negative exemplars. Experiments on standard benchmark datasets show that our approach obtains 81.39% top-1 recall rate on the CVUSA dataset and 71.52% on the CVACT dataset, achieving the state-of-the-art performance and outperforming most of the existing methods significantly.
External IDs:dblp:journals/caaitrit/LiTCY23
Loading