Mitigating Language Biases In Visual Question Answering Through The Forgotten Attention Algorithm

ACL ARR 2024 June Submission354 Authors

10 Jun 2024 (modified: 02 Jul 2024)ACL ARR 2024 June SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: At present, in the field of Visual Question Answering (VQA), a model's ability to comprehend various modalities is crucial for accurate answer reasoning. However, recent studies have uncovered prevailing language biases in VQA, where reasoning frequently relies on incorrect associations between questions and answers, rather than genuine multi-modal knowledge-based reasoning. Thus, it is of great challenge to reveal the accurate relationship between image and question. The key idea of this work is inspired by the process of answering questions of human beings, where people always gradually reduce the focus area in the image with the aid of question information until the final related area is retained. More specifically, we introduce a novel attention algorithm, named the Forgotten Attention Algorithm (FAA), where this algorithm gradually "forgets" some visual contents after several rounds. This deliberate forgetting process concentrates the model's "attention" on the image region that is the most relevant to the question. As a result, it can enhance the integration of image content and thus mitigate language biases. We conducted comprehensive experiments on the VQA-CP v2, VQA v2, and VQA-VS datasets to validate the efficiency and robustness of the algorithm.
Paper Type: Long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: vision question answering
Contribution Types: Surveys
Languages Studied: English
Submission Number: 354
Loading