Orderbench: A Unified Benchmark for Temporal and Causal Reasoning Across Multimodal, World-Model, and Embodied AI Systems

Orderbench: A Unified Benchmark for Temporal and Causal Reasoning Across Multimodal, World-Model, and Embodied AI Systems

ACL ARR 2026 May Submission17394 Authors

26 May 2026 (modified: 02 Jun 2026)ACL ARR 2026 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Temporal reasoning, Unified benchmark, Embodied AI, World models

Abstract: Intelligent systems that operate in the real world, whether generating future predictions, executing embodied instructions, or interpreting complex visual scenes, share a common prerequisite: the ability to reason about when and why events occur. Despite this shared dependency, multimodal large language models (MLLMs), world models, and vision-language-action systems (VLAs) have historically been evaluated in isolation, with no unified framework capable of exposing their common temporal and causal reasoning limitations. We introduce ordercbench , which identifies temporal ordering as the natural intersection across domains, enabling the first comparative evaluation of their temporal and causal reasoning capability through realistic event reconstruction from shuffled video frames. ordercbench adapts this core challenge to each model family's strengths: frame ranking for MLLMs, task progress estimation for VLAs, and future frame prediction for world models, creating an organically unified evaluation framework spanning 4,000 samples across daily life and robotics domains. Extensive experiments reveal that even the most advanced models struggle significantly, with best performance under 40\% accuracy and 10-20\% performance gaps between daily and robotic scenarios, exposing a critical disconnect in temporal cognition. Building on this observation, we further explore factors that elicit temporal and causal reasoning in current models. We believe this work will provide guidance for research on causally-aware world models and embodied AI systems.

Paper Type: Long

Research Area: Resources and Evaluation

Research Area Keywords: Temporal reasoning, Unified benchmark, Embodied AI, World models

Contribution Types: Data analysis

Languages Studied: English

EMNLP 2026 AI Reviewing Experiment: no

Submission Number: 17394

Loading