Keywords: Image Localization;Fine-grained Geolocation; Multimodal Reasoning; Large Vision-Language Models;Multi-stage Fine-tuning
TL;DR: We propose Clue2Geo, a cue-driven framework for global image geolocation leveraging LVLMs, structured cluemaps and multi-stage fine-tuning to achieve high-precision, street-level localization, clearly outperforming existing methods in fine-grained.
Abstract: Global image geolocation aims to identify the location where an image was captured, but achieving high precision and robust localization remains challenging.To enhance geolocation precision, we present Clue2Geo, a cue-driven framework for global image geolocation,powered by a Large Vision Language Model (LVLM) for coordinate reasoning. Firstly, an LVLM is employed to extract diverse geographic cues from images, after which the reliability and contribution of these cues are assessed by computing their local consistency and semantic coherence.Based on that,a cue graph named “cluemap” is constructed,which is used as an auxiliary input both during model fine-tuning and inference. Subsequently, we build a large-scale Street View dataset with coordinates and cluemaps to support a three-stage progressive fine-tuning strategy.This strategy is to enhance the downstream model’s reasoning capabilities for fine-grained localization tasks.Finally, a post-processing refinement based on Retrieval-Augmented Generation (RAG) using a GPS database is applied after reasoning to reduce the offset of the predicted coordinates, improving both accuracy and stability. Extensive experiments demonstrate that Clue2Geo achieves state-of-the-art performance on fine-grained metrics, particularly at the street levels.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 10961
Loading