Beyond Image Scale: Geo-localization of Objects of Interest in Cross-View Images

Lv Bo, Le Wu, Yuanyuan Li, Yingying Zhu

Published: 20 Oct 2024, Last Modified: 15 Nov 2025OpenReview Archive Direct UploadEveryoneCC BY 4.0

Abstract: Cross-View Object geo-Localization (CVOGL) holds significant potential for various applications, but drastic differences in viewpoints and visual appearances between cross-view images make this task extremely challenging. Existing methods often struggle to accurately capture the scene correspondence between images, resulting in suboptimal model performance. To enhance the model's understanding of scene correspondence in cross-view images, we propose a novel method called BISGeo. BISGeo consists of three core components: SSP-based positional encoding, Cross-View Shared-Weight Covariance Fusion (CCF) module, and Global-Local Loss (GLL) function. Specifically, through an analysis of the imaging principles of street-view images, we develop a positional encoding method for describing the location of query points in street-view images, named SSP. Our SSP encoding not only captures the position of the object within the image but also accounts for the positional relationships between pixels. The proposed CCF module is used to enhance the representation capability of common features between two views. Furthermore, we observe that street-view images typically represent a subset of the information in satellite images. The similarity between regions depicting the same scene in street-view images and satellite images should be greater than the similarity between non-corresponding regions. Based on this observation, we design the GLL loss function, which incorporates both the global correlation between the two views and the local region correlations. Extensive experiments demonstrate that BISGeo achieves state-of-the-art performance in cross-view object geo-localization tasks.