VQATherapy: Exploring Answer Differences by Visually Grounding

Chongyan Chen, Samreen Anjum, Danna Gurari

Published: 28 Feb 2023, Last Modified: 04 Mar 2025ICCVEveryoneCC BY 4.0

Abstract: Visual question answering is a task of predicting the answer to a question about an image. Given that different people can provide different answers to a visual question, we aim to better understand why with answer groundings. We introduce the first dataset that visually grounds each unique answer to each visual question, which we call VQA AnswerTherapy. We then propose two novel problems of predicting whether a visual question has a single answer grounding and localizing all answer groundings. We bench mark modern algorithms for these novel problems to show where they succeed and struggle. The dataset and evaluation server can be found publicly at https://vizwiz.org/tasks and-datasets/vqa-answer-therapy/.