P ≈ NP, at least in Visual Question AnsweringDownload PDFOpen Website

Published: 2020, Last Modified: 09 Oct 2023ICPR 2020Readers: Everyone
Abstract: In recent years, progress in the Visual Question Answering (VQA) field has largely been driven by public challenges and large datasets. One of the most widely-used of these is the VQA 2.0 dataset, consisting of polar (“yes/no”) and non-polar questions. Looking at the question distribution over all answers, we find that the answers “yes” and “no” account for 38% of the questions (19% per class), while the remaining 62% are spread over the remaining 3127 answers (0.02% per class). While several sources of biases have been investigated in the field, the effects of such an over-representation of polar questions remain unclear. In this paper, we measure the potential confounding factors when polar and non-polar samples are used jointly to train a baseline VQA classifier, and compare it to an upper bound where the over-representation of polar questions is excluded from the training. Further, we perform cross-over experiments to analyze how well the feature spaces of polar and non-polar samples align. Contrary to expectations, we find no evidence of counterproductive effects in the joint training of unbalanced classes. In fact, by exploring the intermediate feature space of visual-text embeddings, we find that the feature space of polar questions already encodes sufficient structure to answer many non-polar questions. Our results indicate that the polar (P) and the non-polar (NP) feature spaces are strongly aligned, hence the expression P ≈ NP.
0 Replies

Loading