Where in the World? A Vision-Language Benchmark for Probing Model Geolocation Skills Across Scales

ICLR 2026 Conference Submission13205 Authors

18 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Benchmark, Image Geolocation, VLM
Abstract: Vision–language models (VLMs) have advanced rapidly, yet their capacity for image-grounded geolocation in open-world conditions, a task that is challenging and of demand in real life, has not been comprehensively evaluated. We present WhereBench, a comprehensive benchmark for VLM image geolocation that evaluates visual recognition, step-by-step reasoning, and evidence use. WhereBench comprises 810 globally distributed images across two complementary geolocation scales: WhereCountry (i.e., 500 multiple-choice question-answering, with country-level answer and panoramas) and WhereStreet (i.e., 310 fine-grained street-level identification tasks requiring multi-step reasoning with optional web search). For evaluation, we adopt the final-prediction metrics: location accuracies within k km (Acc@k) for coordinates and hierarchical path scores for textual localization. Beyond this, we propose to explicitly score intermediate reasoning chains using human-verified key visual clues and a Shapley-reweighted thinking score that attributes credit by each clue’s marginal contribution. We benchmark 12 state-of-the-art VLMs with web searching tools on our WhereBench and report different types of final answer accuracies as well as the calibrated model thinking scores. We reveal that web search and reasoning do not guarantee improved performance when visual clues are limited, and models exhibit regional biases, achieving up to 42.7% higher scores in certain areas than others. These findings highlight not only the promise but also the persistent challenges of models to mitigate bias and achieve robust, fine-grained localization.
Primary Area: datasets and benchmarks
Submission Number: 13205
Loading