Answer Me if You Can: Debiasing Video Question Answering via Answering Unanswerable QuestionsDownload PDF

Published: 01 Feb 2023, Last Modified: 13 Feb 2023Submitted to ICLR 2023Readers: Everyone
Keywords: Video Question Answering, debiasing, causal inference
TL;DR: We propose a novel framework for VideoQA which is possible to learn confounders existing in the dataset even when confounders are unobserved and to effectively remove the effects of learned confounders.
Abstract: Video Question Answering (VideoQA) is a task to predict a correct answer given a question-video pair. Recent studies have shown that most VideoQA models rely on spurious correlations induced by various biases when predicting an answer. For instance, VideoQA models tend to predict `two’ as an answer without considering the video if a question starts with ``How many’' since the majority of answers to such type of questions are `two’. In causal inference, such bias ($\textit{question type}$), which simultaneously affects the input $X$ ($\textit{How many...}$) and the answer $Y$ ($\textit{two}$), is referred to as a confounder $Z$ that hinders a model from learning the true relationship between the input and the answer. The effect of the confounders $Z$ can be removed with a causal intervention $P(Y|do(X))$ when $Z$ is observed. However, there exist many unobserved confounders affecting questions and videos, $\textit{e.g.}$, dataset bias induced by annotators who mainly focus on human activities and salient objects resulting in a spurious correlation between videos and questions. To address this problem, we propose a novel framework that learns unobserved confounders by capturing the bias using $\textit{unanswerable}$ questions, which refers to an artificially constructed VQA sample with a video and a question from two different samples, and leverages the confounders for debiasing a VQA model through causal intervention. We demonstrate that our confounders successfully capture the dataset bias by investigating which part in a video or question that confounders pay attention to. Our experiments on multiple VideoQA benchmark datasets show the effectiveness of the proposed debiasing framework, resulting in an even larger performance gap compared to biased models under the distribution shift.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics
Submission Guidelines: Yes
Please Choose The Closest Area That Your Submission Falls Into: Deep Learning and representational learning
Supplementary Material: zip
16 Replies

Loading