MFRGN: Multi-scale Feature Representation Generalization Network For Ground-to-Aerial Geo-localization

Published: 20 Jul 2024, Last Modified: 21 Jul 2024MM2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Cross-area evaluation poses a significant challenge for ground-to-aerial geo-localization (G2AGL), in which the training and testing data are captured from entirely distinct areas. However, current methods struggle in cross-area evaluation due to their emphasis solely on learning global information from single-scale features. Some efforts alleviate this problem but rely on complex and specific technologies like pre-processing and hard sample mining. To this end, we propose a pure end-to-end solution, free from task-specific techniques, termed the Multi-scale Feature Representation Generalization Network (MFRGN) to improve generalization. Specifically, we introduce multi-scale features and explicitly utilize them for G2GAL. Furthermore, we devise an efficient global-local information module with two flows to bolster feature representations. In the global flow, we present a lightweight Self and Cross Attention Module (SCAM) to efficiently learn global embeddings. In the local flow, we develop a Global-Prompt Attention Block (GPAB) to capture discriminative features under the global embeddings as prompts. As a result, our approach generates robust descriptors representing multi-scale global and local information, thereby enhancing the model's invariance to scene variations. Extensive experiments on benchmarks show our MFRGN achieves competitive performance in same-area evaluation and improves cross-area generalization by a significant margin compared to SOTA methods.
Primary Subject Area: [Experience] Multimedia Applications
Relevance To Conference: This work aims to estimate the geographic location of ground-view photos collected from social multimedia and vehicular cameras by matching with given GPS-tagged aerial-view images, a kind of multimedia data captured from the sky, which belongs to the multimedia application, is also concerning on innovative computing methods for multimedia data processing. Our method facilitates accurate and fast retrieval in multimedia databases, even under challenging scenarios, thereby enabling users to search images based not only on textual or visual content but also on their geographical context. Our work involves joint learning and integrating representative information for the images from distinct multiple perspectives and data acquisition devices (some researchers actually regard these images as multimodal data), which helps to efficiently process and fuse various types of multimedia/multimodal data. This work is also highly relevant to drone-to-satellite geo-localization, where multimedia data from unmanned aerial vehicles (UAVs) plays a crucial role in many multimedia applications. Moreover, our approach can readily extend to other multimedia tasks, such as aerial-view imaging and action recognition, leveraging cross-view image geo-localization, which have thoroughly explored in ACM Multimedia Workshop on UAVs in Multimedia (UAVM 2023).
Submission Number: 3943
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview