Video Causal Understanding with Scene-conditioned Counterfactuals

20 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Attribution analysis, Causal inference, Counterfactual, Video understanding
TL;DR: We propose a framework for attributing specific events in video by modeling confounders with learned representations, enabling robust counterfactual inference.
Abstract: Understanding the causal consequences of actions is critical for developing reliable embodied agents. A key challenge is attributing an observed outcome to a particular action within a video. Existing video analysis methods often rely on correlation and struggle to distinguish true causation from spurious association, as they do not explicitly model confounding factors. To address this, we reframe the task as a retrospective counterfactual inquiry, which allows us to quantify an action's necessity for an outcome. We then propose an efficient and doubly robust estimator that adjusts for confounding variables learned from video frames, providing resilience against model misspecification. To validate our approach, we conduct experiments in a controlled environment. The results show that our method provides more accurate causal attribution compared to baselines.
Primary Area: causal reasoning
Submission Number: 24305
Loading