Keywords: LLMs, Spatial Understanding, Generative Agents, Benchmarking
Abstract: While Multimodal Large Language Models (MLLMs) have demonstrated impressive
capabilities in passive visual perception and question answering, their ability
to actively manipulate and design spatial environments remains a
critical open challenge. As generative agents transition from text-based tasks
to complex GUI and whiteboard operations, the distinction between “seeing" a
spatial layout and effectively “acting" upon it becomes paramount. To bridge
this gap, we present ASPIRE, a benchmark for Agentic Spatial Performance in
Whiteboard Environments. Unlike previous benchmarks that rely on static
multiple-choice questions, ASPIRE evaluates agents on open-ended state
manipulation, requiring them to create, update, and organize visual elements to
satisfy spatial constraints. Our extensive evaluation of state-of-the-art models
reveals a fundamental dichotomy: while agents excel at discrete, structured
tasks (e.g., maze navigation, graph coloring) where visual data can be mapped to
symbolic logic, they struggle significantly with continuous, intuitive reasoning
(e.g., visual balance, angular rotation). Furthermore, our ablation studies
uncover a “scaffolding paradox": providing visual aids such as grids or polar
plots often degrades performance, suggesting that current MLLMs rely heavily on
semantic metadata rather than robust visual-spatial grounding.
Paper Type: Long
Research Area: AI/LLM Agents
Research Area Keywords: LLM agents, tool use, multi-modal agents, environment interaction, cross-modal information extraction, benchmarking
Contribution Types: Model analysis & interpretability, Data analysis
Languages Studied: English
Submission Number: 7612
Loading