HVCR: Causal Evaluation of Large Multimodal Models in Human-like Video Reasoning

HVCR: Causal Evaluation of Large Multimodal Models in Human-like Video Reasoning

ICLR 2026 Conference Submission292 Authors

01 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Video Causal Reasoning, Human-like Causal Reasoning, Actual Causality, Causal Judgment

Abstract: The pursuit of human-like causal reasoning in large multimodal models (LMMs) is a critical yet challenging frontier. Current video causal reasoning benchmarks often lack a systematic design that aligns with the nuances of human causal cognition. To address this gap, we introduce the HVCR benchmark for evaluating human-like video causal reasoning in LMMs, systematically designed across three levels. At the **definitional** level, since causal relations in videos are inherently *specific* and naturally align with the field of actual causality, we adopt definitions from this domain that robustly handle complex scenarios like preemption, which can not be modeled by the simple "but-for" test. At the **goal-oriented** level, our aim is to simulate human causal judgments rather than fitting formal definitions or frameworks. Therefore, we establish our gold standard using human "consensus" from rigorous human experiments in cognitive science, leveraging seven well-studied causal scenarios as reliable references. At the **representational** level, we employ explicit *causal graphs* and a variant of *twin networks* to enable automatic generation of causal questions. The HVCR benchmark contains 300 videos (240 synthetic and 60 realistic) and 4,967 causal questions. These questions span three causal rungs (discovery, intervention, and counterfactual) and eight types, focusing on key aspects of human-like causal reasoning such as causal attribution and responsibility. Human evaluation shows that average observers achieve nearly 80\% accuracy on our synthetic videos, confirming their clarity. However, current LMMs underperform on both synthetic and real-world videos, revealing a significant gap in their human-like causal reasoning capabilities. To our knowledge, HVCR is the first video causal reasoning benchmark to systematically integrate these three design levels, jointly consider synthetic and real-world settings, and focus exclusively on pure causal reasoning of LMMs.

Primary Area: datasets and benchmarks

Submission Number: 292

Loading