TL;DR: We propose a framework to endow VLMs the power to predict the next trajectory location to benchmark the VLMs.
Abstract: Predicting the next location is a hallmark of spatial intelligence.
In real-world scenarios, humans often rely on visual estimation to perform next-location prediction, such as anticipating movement to avoid collisions with others.
With the emergence of large models demonstrating general visual capabilities, we explore whether vision-language models (VLMs) can perform similar next location prediction as human.
We present \textbf{VLMLocPredictor}, a benchmark for evaluating VLMs on next location prediction tasks by contributing: (1) the Visual Guided Location Search (VGLS) module, a recursive refinement strategy leveraging visual guidance to iteratively narrow the search space for predictions; (2) a comprehensive vision-based dataset integrating open-source map taxi trajectory; (3) a human benchmark established via a large-scale social experiment.
Through over 1000 queries on 14 VLMs, our findings indicate that VLMs exhibit promising potential for next-location prediction through our methods. However, their performance currently does not reach human-level accuracy. While some VLMs show potential to outperform humans in 24\% scenarios, we believe in the near future, VLMs will surpass the average human performance in next-location prediction tasks.
The benchmark and resources are available at \url{https://ihhh.cn}.
Primary Area: Deep Learning->Large Language Models
Keywords: Next Location Prediction, VisionLanguage Model, Machine Learning
Submission Number: 740
Loading