Can you even tell left from right? Presenting a new challenge for VQA

Published: 03 Jan 2024, Last Modified: 13 Nov 2024OpenReview Archive Direct UploadEveryoneCC BY-NC-ND 4.0
Abstract: Visual Question Answering (VQA) needs a means of evaluating the strengths and weaknesses of models. One aspect of such an evaluation is the measurement of compositionalgeneralisation. Thisrelatestotheabilityofamodel to answer well on scenes whose compositions are different from those of scenes in the training dataset. In this work, we present several quantitative measures of compositional separation and find that popular datasets for VQA are not good compositional evaluators. To solve this, we present Uncommon Objects in Unseen Configurations (UOUC), a synthetic dataset for VQA. UOUC is at once fairly complex while also being compositionally well-separated. The object-class of UOUC consists of 380 clasess taken from 528 characters from the Dungeons and Dragons game. The training dataset of UOUC consists of 200,000 scenes; whereas the test set consists of 30,000 scenes. In order to study compositional generalisation, simple reasoning and memorisation, each scene of UOUC is annotated with up to 10 novel questions. These deal with spatial relationships, hypothetical changes to scenes, counting, comparison, memorisation and memory-based reasoning. In total, UOUCpresentsover2millionquestions. Ourevaluationof recent state-of-the-art models for VQA shows that they exhibitpoorcompositionalgeneralisation,andcomparatively lower ability towards simple reasoning. These results suggest that UOUC could lead to advances in research by being a strong benchmark for VQA, especially in the study of compositional generalisation.
Loading