ENACT: Evaluating Embodied Cognition with World Modeling of Egocentric Interaction

Qineng Wang; Wenlong Huang; Yu Zhou; Hang Yin; Tianwei Bao; Jianwen Lyu; Weiyu Liu; Ruohan Zhang; Jiajun Wu; Li Fei-Fei; Manling Li

ENACT: Evaluating Embodied Cognition with World Modeling of Egocentric Interaction

Qineng Wang, Wenlong Huang, Yu Zhou, Hang Yin, Tianwei Bao, Jianwen Lyu, Weiyu Liu, Ruohan Zhang, Jiajun Wu, Li Fei-Fei, Manling Li

Published: 23 Sept 2025, Last Modified: 22 Nov 2025LAWEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Embodied Agents, Vision Language Models, Benchmarking, World Modeling

TL;DR: We introduce ENACT, which probes embodied cognition in VLMs via egocentric world modeling. Two permutation tasks (forward/inverse) built from scalable simulator data with online verification.

Abstract: We introduce ENACT, a scalable benchmark for studying Embodied Cognition via world modeling through egocentric interaction, probing how spatial perception, physical interaction, and language cohere in modern Vision–Language Models (VLMs). Grounded in a POMDP view of decision making, ENACT comprises two complementary permutation tasks: forward world modeling (reorder future observations to match a given action sequence) and inverse world modeling (reorder actions to explain a given observation sequence). Data are generated by replaying diverse household activities in the reproducible simulator (Behavior) that aligns symbolic scene graphs with egocentric RGB, yielding 8,972 QA items. Predictions are validated by an online verifier that accepts any sequence consistent with task constraints, and we report Task Accuracy (exact ordering) and Pairwise Accuracy (adjacent consistency). Across evaluated VLMs, performance degrades with longer interaction horizons, and inverse is consistently easier than forward. Targeted probes of GPT-5 mini and InternVL-3.5 show limited sensitivity to image realism and robot appearance, and GPT-5 mini exhibits marked sensitivity to camera-distribution shifts (elevated viewpoints, extreme apertures, fisheye). Both models display a handedness asymmetry with fewer right-hand errors. Overall, ENACT offers a scalable proxy for studying embodied cognition and a tool to inform models that better bind perception to action over long horizons.

Submission Type: Benchmark Paper (4-9 Pages)

NeurIPS Resubmit Attestation: This submission is not a resubmission of a NeurIPS 2025 submission.

Submission Number: 155

Loading