REM: Evaluating LLM Embodied Spatial Reasoning through Multi-Frame Trajectories

Jacob Thompson; Emiliano Garcia-Lopez; Yonatan Bisk

REM: Evaluating LLM Embodied Spatial Reasoning through Multi-Frame Trajectories

Jacob Thompson, Emiliano Garcia-Lopez, Yonatan Bisk

Published: 08 Jul 2025, Last Modified: 26 Aug 2025COLM 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Multimodal Reasoning, Embodied AI, VLMs, Spatial Reasoning, Object Permanence, Visuospatial Representation, Large Language Models (LLMs), Egocentric Vision, Video Understanding, Evaluation Benchmarks, Synthetic Environments, Long-horizon Reasoning, Numerical Tracking, Temporal Ordering, Scene Understanding, LLM Limitations

TL;DR: We introduce REM, a benchmark revealing that current multimodal language models lack fundamental abilities in spatial reasoning, object permanence, and tracking objects over changing viewpoints.

Abstract: Humans build viewpoint-independent cognitive maps through navigation, enabling intuitive reasoning about object permanence and spatial relations. We argue that multimodal large language models (MLLMs), despite extensive video training, lack this fundamental spatial reasoning capability, a critical limitation for embodied applications. To demonstrate these limitations and drive research, we introduce REM: Reasoning over Embodied Multi-Frame Trajectories, a benchmark using controllable 3D environments for long-horizon embodied spatial reasoning. REM systematically evaluates key aspects like object permanence/distinction, spatial relationships, and numerical tracking across dynamic embodied viewpoints. Our evaluation shows that the best-performing current models exhibit promising overall performance, but become increasingly unreliable at even moderate complexity levels easily handled by humans. These findings highlight challenges MLLMs face in developing robust spatial representations from sequential visual input. Consequently, REM provides targeted metrics and diagnostics to foster improved spatial understanding in future models.

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the COLM Code of Ethics on https://colmweb.org/CoE.html

Author Guide: I certify that this submission complies with the submission instructions as described on https://colmweb.org/AuthorGuide.html

Submission Number: 1451

Loading