Interventional Video Relation Detection

Yicong Li, Xun Yang, Xindi Shang, Tat-Seng Chua

2021 (modified: 17 Nov 2022)ACM Multimedia 2021Readers: Everyone

Abstract: Video Visual Relation Detection (VidVRD) aims to semantically describe the dynamic interactions across visual concepts localized in a video in the form of subject, predicate, object. It can help to mitigate the semantic gap between vision and language in video understanding, thus receiving increasing attention in multimedia communities. Existing efforts primarily leverage the multimodal/spatio-temporal feature fusion to augment the representation of object trajectories as well as their interactions and formulate the prediction of predicates as a multi-class classification task. Despite their effectiveness, existing models ignore the severe long-tailed bias in VidVRD datasets. As a result, the models' prediction will be easily biased towards the popular head predicates (e.g., next-to and in-front-of), thus leading to poor generalizability. To fill the research gap, this paper proposes an Interventional Video Relation Detection (IVRD) approach that aims to improve not only the accuracy but also the robustness of the model prediction. Specifically, to better model the high-level visual predicate, our IVRD consists of two key components: 1) we first learn a set of predicate prototypes, where each prototype vector describes a set of relation references with the same predicate; and 2) we apply a causality-inspired intervention on the model input subject, object, which forces the model to fairly incorporate each possible predicate prototype into consideration. We expect the model to focus more on the visual content of the dynamic interaction between subject and object, rather than the spurious correlations between the model input and predicate labels. Extensive experiments on two popular benchmark datasets show the effectiveness of IVRD and also its advantages in reducing the bad long-tailed bias.

0 Replies