Keywords: multimodal LLM, spatial planning
TL;DR: We reveal the deficiencies of current multimodal LLMs in visual spatial planning through a series of carefully designed tasks presented in our benchmark.
Abstract: With the recent introduction of vision understanding capabilities in large language models, multimodal LLMs (MLLMs) have inherited and advanced a series of intriguing capabilities from classical LLMs. Among these capabilities, visual spatial planning - the ability to comprehend the spatial arrangements of objects and devise action plans to achieve specific desired outcomes - remains under-explored in MLLMs. In our study, we introduce VSP, a benchmark specifically designed to 1) evaluate the spatial planning capability in these models in general, and 2) break down the visual planning task into finer-grained sub-tasks, including perception and reasoning, and measure their capabilities in these sub-tasks. Contrary to expectations that MLLMs should naturally process scene images and reason effectively, evaluation on the benchmark shows that both open-source and private MLLMs fail to generate effective plans for even simple spatial planning tasks. The fine-grained analysis further reveals that while MLLMs have flaws in both perception and reasoning, the deficiency in the former capabilities is significantly worse. Evaluations on these tasks reveal fundamental deficiencies in the models’ visual perception and reasoning abilities, explaining their worse performance in the general spatial planning tasks. Our work illuminates future directions for improving multimodal LLMs' abilities in spatial planning.
Supplementary Material: pdf
Primary Area: datasets and benchmarks
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 7966
Loading