Keywords: geo-temporal reasoning, vision–language models (VLMs), image geo-localization, spatial reasoning, temporal understanding, benchmark datasets
Abstract: Geo-temporal understanding, the ability to identify the location, time, and contextual features of an image from visual cues alone, is a fundamental aspect of human intelligence with wide-ranging applications, from disaster response to navigation and geography education. While recent vision–language models (VLMs) have shown progress in image geo-localization using conspicuous cues like landmarks or road signs, their ability to understand temporal signals and related spatial reasoning cues remains underexplored. To address this gap, we introduce **TimeSpot**, a comprehensive benchmark for evaluating real-world geo-temporal reasoning in VLMs. **TimeSpot** contains 1,455 images from 80 countries, where models must infer temporal attributes (season, month, time of day, daylight phase) and geolocation attributes (continent, country, climate zone, environment type, latitude–longitude coordinates) directly from the image. In addition, it includes spatial reasoning tasks that require integrating geographical, spatial, and temporal cues to solve complex understanding problems. Unlike prior benchmarks that emphasize obvious cues or iconic imagery, **TimeSpot** prioritizes diverse and subtle settings, reflecting the difficulty of reasoning under real-world uncertainty. Our evaluation of state-of-the-art VLMs, including both open- and closed-source models, shows consistently low performance across tasks, underscoring the substantial challenges that remain for robust temporal and geographic reasoning and the need for improved methods to achieve reliable and trustworthy geo-temporal understanding in VLMs.
Primary Area: datasets and benchmarks
Submission Number: 11153
Loading