Keywords: Video Causal Reasoning, Human-like Causal Reasoning, Actual Causality, Causal Judgment
Abstract: The pursuit of human-like causal reasoning in large multimodal models (LMMs) is a critical yet challenging frontier. Current video causal reasoning benchmarks often lack a systematic design that aligns with the nuances of human causal cognition. To address this gap, we introduce the HVCR benchmark for evaluating human-like video causal reasoning in LMMs, systematically designed across three levels.
At the **definitional** level, since causal relations in videos are inherently *specific* and naturally align with the field of actual causality, we adopt definitions from this domain that robustly handle complex scenarios like preemption, which can not be modeled by the simple "but-for" test. At the **goal-oriented** level, our aim is to simulate human causal judgments rather than fitting formal definitions or frameworks. Therefore, we establish our gold standard using human "consensus" from rigorous human experiments in cognitive science, leveraging seven well-studied causal scenarios as reliable references. At the **representational** level, we employ explicit *causal graphs* and a variant of *twin networks* to enable automatic generation of causal questions.
The HVCR benchmark contains 300 videos (240 synthetic and 60 realistic) and 4,967 causal questions. These questions span three causal rungs (discovery, intervention, and counterfactual) and eight types, focusing on key aspects of human-like causal reasoning such as causal attribution and responsibility.
Human evaluation shows that average observers achieve nearly 80\% accuracy on our synthetic videos, confirming their clarity. However, current LMMs underperform on both synthetic and real-world videos, revealing a significant gap in their human-like causal reasoning capabilities.
To our knowledge, HVCR is the first video causal reasoning benchmark to systematically integrate these three design levels, jointly consider synthetic and real-world settings, and focus exclusively on pure causal reasoning of LMMs.
Primary Area: datasets and benchmarks
Submission Number: 292
Loading