PDDLPUZZLEVQA: Benchmarking Visual Planning Puzzle solving abilities using Large VLMs and Symbolic Planners

ACL ARR 2025 February Submission6600 Authors

16 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Planning is a core aspect of human intelligence. Recent planning benchmarks have proved to be challenging to a wide range of Large Language Models. Yet, planning in the context of vision has not been extensively explored. To feel this void and establish a sufficiently challenging reasoning benchmark for Vision-Language Models, we introduce PDDLPuzzleVQA, which is a collection of $\sim10k$ puzzles encompassing six well-known types (such as Maze-Solving, N-Queens), which explicitly require multiple-step planning to solve. We further accompany each puzzle problem with a groundtruth symbolic representation in Plan Domain Definition Language (PDDL); which in turn can be used to generate an executable plan using a symbolic planner. Therefore, we benchmark both end-to-end plan generation ability and VLM's ability to represent a planning problem presented as image and text into PDDL. Our experiments show huge deficits of state-of-the-art VLMs such as GPT4o, Gemini-flash and InternVL2.5 in all variations plan generation. Delving deeper, we analyze various syntactic and semantic errors of the VLMs while generating PDDL representation. Our dataset is the first vision and reasoning dataset to focus solely on planning puzzles, accompanied with groudtruth PDDL representation and hard benchmark for the most efficient VLMs.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: PDDL, logical reasoning, spatial reasoning, reasoning, visual perception, visual understanding, world knowledge, domain knowledge, planning, puzzle, VQA, visual puzzle, multimodal, VLM, LLM, symbolic planners, enhsp, formal logic
Contribution Types: Model analysis & interpretability, Data resources, Data analysis
Languages Studied: English, PDDL
Submission Number: 6600
Loading