Deconfounded Multimodal Learning for Spatio-temporal Video Grounding

Jiawei Wang, Zhanchang Ma, Da Cao, Yuquan Le, Junbin Xiao, Tat-Seng Chua

Published: 01 Jan 2023, Last Modified: 05 Nov 2023ACM Multimedia 2023Readers: Everyone

Abstract: The task of spatio-temporal video grounding involves identifying the spatial and temporal regions in a video that correspond to the objects or actions described in a given textual description. However, current models used for spatio-temporal video grounding often rely heavily on spatio-temporal priors to make the predictions. As a result, they may suffer from spurious correlations and lack the ability to generalize well to new or diverse scenarios. To overcome this limitation, we introduce a deconfounded multimodal learning framework, which utilizes a structural causal model to treat dataset biases as a confounder and subsequently remove their confounding effect. Through this framework, we can perform causal intervention on the multimodal input and derive an unbiased estimation formula through the do-calculus technique. In order to tackle the challenge of diverse and often unobservable confounders, we further propose a novel retrieval-based approach with a causal mask mechanism. The proposed method leverages analogical reasoning to facilitate deconfounded learning and mitigate dataset biases, enabling unbiased spatio-temporal prediction without explicitly modeling the confounding factors. Extensive experiments on two challenging benchmarks have well verified the effectiveness and rationality of our proposed solution.

0 Replies