Text-Only Grid Spatial Understanding for Embodied Agents

Text-Only Grid Spatial Understanding for Embodied Agents

ACL ARR 2026 January Submission7912 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: spatial understanding, embodied agents, spatial reasoning

Abstract: Spatial reasoning abilities have become more important to recent tasks. However, we do not understand how LLMs reason about space or utilize their multi-modal inputs in such tasks. As a starting point, %setting aside the complexities of additional modalities, we introduce a dataset of text-only spatial reasoning problems on grids\footnote{Our dataset and code will be freely available at URL.} to understand what abilities LLMs have without exposure to visual modalities and compare the performance of a variety of LLMs and VLMs on these tasks. We find that even text-only models have some implicit or mathematical understanding of grids and the 3D space they represent; however, even VLMs and foundation models fall short when asked to reason about space from the perspective of an embodied agent (i.e. with its own frame of spatial reference). We also find that models struggle more when instructing others and when they need to recognize real-world concepts within a grid.

Paper Type: Long

Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond

Research Area Keywords: reasoning, multi-modal agents, embodied agents, LLM agents

Contribution Types: Model analysis & interpretability

Languages Studied: English

Submission Number: 7912

Loading