Towards Interactive Global Geolocation Assistant

10 May 2025 (modified: 30 Oct 2025)Submitted to NeurIPS 2025 Datasets and Benchmarks TrackEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Interactive geolocation, multimodal large language model, chain of thought
Abstract: Global geolocation, the task of predicting the exact location of street-view images, is crucial for applications like security surveillance. Existing retrieval and classification methods, along with current Multimodal Large Language Models (MLLMs), suffer from limitations such as database dependence, lack of interpretability, and a significant gap in geographic knowledge due to insufficient datasets. To address these issues, we introduce MG-Geo, the first comprehensive, high-quality multimodal Global Geolocation dataset. Comprising five million instances of geographic dialogue data across 210 countries, MGGeo provides detailed geographic element cues (e.g., road markings, vegetation, language), significantly surpassing existing datasets like OSV-5M and Google Landmark V2 in richness and granularity. Leveraging MGGeo, we develop GaGA (Global Geolocation Assistant), a novel MLLM specifically designed for geolocation. Experimental results demonstrate that GaGA not only significantly outperforms existing MLLMs but also surpasses the state-of-the-art model OSV-5M-Baseline in administrative boundary prediction (achieving improvements of 4.57% at the country and 2.92% at the city levels). Furthermore, GaGA exhibits remarkable interactive refinement capabilities, improving localization accuracy with effective user guidance. This work highlights the critical role of the MGGeo dataset in fostering improved geographic understanding of MLLM. Our dataset is accessible via: https://huggingface.co/datasets/kendouvg/MG-Geo.
Croissant File: json
Dataset URL: https://huggingface.co/datasets/kendouvg/MG-Geo
Primary Area: Datasets & Benchmarks for applications in language modeling and vision language modeling
Submission Number: 1153
Loading