GLEAM: Learning to Match and Explain in Cross-View Geo-Localization

GLEAM: Learning to Match and Explain in Cross-View Geo-Localization

ICLR 2026 Conference Submission39 Authors

01 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Remote Sensing, Cross-View Geo-Localization, Multimodal Large Language Model

TL;DR: This work presents GLEAM-C and GLEAM-X, a unified pipeline that advances cross-view geo-localization by integrating multi-view alignment with interpretable, explainable reasoning.

Abstract: Cross-View Geo-Localization (CVGL) focuses on identifying correspondences between images captured from distinct perspectives of the same geographical location. However, existing CVGL approaches are typically restricted to a single view or modality, and their direct visual matching strategy lacks interpretability: they only determine whether two images correspond, without explaining the rationale behind the match. In this paper, we present GLEAM-C, a foundational CVGL model that unifies multiple views and modalities—including UAV imagery, street maps, panoramic views, and ground photographs—by aligning them exclusively with satellite imagery. Our framework enhances training efficiency through optimized implementation while achieving accuracy comparable to prior modality-specific CVGL models through a two-phase training strategy. Moreover, to address the lack of interpretability in traditional CVGL methods, we leverage the reasoning capabilities of multimodal large language models (MLLMs) to propose a new task, GLEAM-X, which combines cross-view correspondence prediction with explainable reasoning. To support this task, we construct a bilingual benchmark using GPT-4o and Doubao-1.5-Thinking-Vision-Pro to generate training and testing data. The test set is further refined through detailed human revision, enabling systematic evaluation of explainable cross-view reasoning and advancing transparency and scalability in geo-localization. Together, GLEAM-C and GLEAM-X form a comprehensive CVGL pipeline that integrates multi-modal, multi-view alignment with interpretable correspondence analysis, unifying accurate cross-view matching with explainable reasoning and advancing **G**eo-**L**ocalization by enabling models to better **E**xplain **A**nd **M**atch. Code and datasets used in this work will be made publicly accessible.

Supplementary Material: zip

Primary Area: foundation or frontier models, including LLMs

Submission Number: 39

Loading