Keywords: Contrastive Learning, Geo-Localization
Abstract: Global image geo-localization aims to predict the precise geographic location of a photo anywhere on Earth based on a single image. This task is highly challenging yet widely applicable, especially in GPS-denied scenarios such as robotic navigation, post-disaster rescue, and open-world understanding. Existing methods often overlook the geographic information embedded in the language modality, making it difficult to resolve visual ambiguity and handle the heterogeneous global image distribution. To address these issues, we propose a unified image–text–GPS tri-modal contrastive learning framework to enhance the robustness and accuracy of global geo-localization.
We first construct a high-quality tri-modal annotation pipeline that integrates semantic segmentation, visual-language generation, and a referee mechanism to automatically generate image-level and region-level descriptions. Geographic labels such as city and country names are also introduced as textual supplements. We then design a unified projection space where image, text, and GPS coordinates are embedded into a shared representation. A dual-level contrastive learning strategy at both global and regional scales is employed to strengthen semantic–spatial alignment across modalities. In addition, we introduce a hierarchical consistency loss and a dynamic hard negative mining strategy to further enhance representational discrimination and structural stability.
Experimental results demonstrate that our method surpasses existing state-of-the-art approaches on multiple public geo-localization benchmarks, including Im2GPS3k, GWS15k, and YFCC26k, validating the effectiveness and generality of tri-modal alignment for global image geo-localization.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 3598
Loading