ASPIRE: Bridging the Gap Between Visual Perception and Spatial Agency

ACL ARR 2026 January Submission7612 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: LLMs, Spatial Understanding, Generative Agents, Benchmarking
Abstract: While Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities in passive visual perception and question answering, their ability to actively manipulate and design spatial environments remains a critical open challenge. As generative agents transition from text-based tasks to complex GUI and whiteboard operations, the distinction between “seeing" a spatial layout and effectively “acting" upon it becomes paramount. To bridge this gap, we present ASPIRE, a benchmark for Agentic Spatial Performance in Whiteboard Environments. Unlike previous benchmarks that rely on static multiple-choice questions, ASPIRE evaluates agents on open-ended state manipulation, requiring them to create, update, and organize visual elements to satisfy spatial constraints. Our extensive evaluation of state-of-the-art models reveals a fundamental dichotomy: while agents excel at discrete, structured tasks (e.g., maze navigation, graph coloring) where visual data can be mapped to symbolic logic, they struggle significantly with continuous, intuitive reasoning (e.g., visual balance, angular rotation). Furthermore, our ablation studies uncover a “scaffolding paradox": providing visual aids such as grids or polar plots often degrades performance, suggesting that current MLLMs rely heavily on semantic metadata rather than robust visual-spatial grounding.
Paper Type: Long
Research Area: AI/LLM Agents
Research Area Keywords: LLM agents, tool use, multi-modal agents, environment interaction, cross-modal information extraction, benchmarking
Contribution Types: Model analysis & interpretability, Data analysis
Languages Studied: English
Submission Number: 7612
Loading