Keywords: Embodied Agents, Vision Language Models, Benchmarking, World Modeling
TL;DR: We introduce ENACT, which probes embodied cognition in VLMs via egocentric world modeling. Two permutation tasks (forward/inverse) built from scalable simulator data with online verification.
Abstract: Embodied cognition argues that intelligence arises from continuous sensorimotor interaction with the world. It raises an intriguing question: do modern vision-language models (VLMs), trained largely in a disembodied manner, exhibit signs of embodied cognition? To investigate this, we introduce **ENACT**, a benchmark that probes this question through world modeling from egocentric interaction. Grounded in a partially observable Markov decision process (POMDP) framework, **ENACT** comprises two complementary sequence reordering tasks: forward world modeling (predicting an ordered sequence of future states from actions) and inverse world modeling (inferring an ordered sequence of actions from state changes). Correctly solving these tasks indicates that the model has a solid understanding of how the environment will evolve given one's actions. Our scalable dataset contains 8,972 QA pairs derived from diverse, long-horizon household activities in the BEHAVIOR simulator. Experiments reveal a significant performance gap between state-of-the-art VLMs and humans, which widens dramatically as interaction horizons lengthen. We find that models consistently solve the inverse problem better than the forward one and exhibit strong embodied biases, showing a preference for right-handed actions and performance degradation with camera perspectives that deviate from those of human vision. Code and supplementary materials are available in our [anonymous repository](https://github.com/iclrsubmission2026/iclr-2026-submission).
Supplementary Material: zip
Primary Area: datasets and benchmarks
Submission Number: 445
Loading