Orderbench: A Unified Benchmark for Temporal and Causal Reasoning Across Multimodal, World-Model, and Embodied AI Systems

ACL ARR 2026 May Submission17394 Authors

26 May 2026 (modified: 02 Jun 2026)ACL ARR 2026 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Temporal reasoning, Unified benchmark, Embodied AI, World models
Abstract: Intelligent systems that operate in the real world, whether generating future predictions, executing embodied instructions, or interpreting complex visual scenes, share a common prerequisite: the ability to reason about when and why events occur. Despite this shared dependency, multimodal large language models (MLLMs), world models, and vision-language-action systems (VLAs) have historically been evaluated in isolation, with no unified framework capable of exposing their common temporal and causal reasoning limitations. We introduce ordercbench , which identifies temporal ordering as the natural intersection across domains, enabling the first comparative evaluation of their temporal and causal reasoning capability through realistic event reconstruction from shuffled video frames. ordercbench adapts this core challenge to each model family's strengths: frame ranking for MLLMs, task progress estimation for VLAs, and future frame prediction for world models, creating an organically unified evaluation framework spanning 4,000 samples across daily life and robotics domains. Extensive experiments reveal that even the most advanced models struggle significantly, with best performance under 40\% accuracy and 10-20\% performance gaps between daily and robotic scenarios, exposing a critical disconnect in temporal cognition. Building on this observation, we further explore factors that elicit temporal and causal reasoning in current models. We believe this work will provide guidance for research on causally-aware world models and embodied AI systems.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: Temporal reasoning, Unified benchmark, Embodied AI, World models
Contribution Types: Data analysis
Languages Studied: English
EMNLP 2026 AI Reviewing Experiment: no
Submission Number: 17394
Loading