Abstract: The task of spatio-temporal video grounding involves identifying the spatial and temporal regions in a video that correspond to the objects or actions described in a given textual description. However, current models used for spatio-temporal video grounding often rely heavily on spatio-temporal priors to make the predictions. As a result, they may suffer from spurious correlations and lack the ability to generalize well to new or diverse scenarios. To overcome this limitation, we introduce a deconfounded multimodal learning framework, which utilizes a structural causal model to treat dataset biases as a confounder and subsequently remove their confounding effect. Through this framework, we can perform causal intervention on the multimodal input and derive an unbiased estimation formula through the do-calculus technique. In order to tackle the challenge of diverse and often unobservable confounders, we further propose a novel retrieval-based approach with a causal mask mechanism. The proposed method leverages analogical reasoning to facilitate deconfounded learning and mitigate dataset biases, enabling unbiased spatio-temporal prediction without explicitly modeling the confounding factors. Extensive experiments on two challenging benchmarks have well verified the effectiveness and rationality of our proposed solution.
0 Replies
Loading