Eliminating Language Bias in Visual Question Answering with Potential Causality Models

ACL ARR 2024 June Submission356 Authors

10 Jun 2024 (modified: 02 Jul 2024)ACL ARR 2024 June SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: The main goal of Visual Question Answering (VQA) is to effectively learn useful information from vision and language to perform answer reasoning. However, recent studies have shown that VQA models often have language bias, which is the false correlations between questions and answers, rather than truly extracting answers from multi-modal knowledge. Existing methods mainly focus on modeling the question part to capture the language bias, while ignoring the influence of visual content on the model. To address this issue, in this paper, we combine potential causal models with VQA models, using dual-attention as treatment, and treating language bias as a confounding factor in the model. We enhance the role of visual information in the VQA model through the construction of observed and counterfactual outcomes, thus eliminating the impact of language bias on the VQA model. We conduct experiments on the VQA-CP v2 and VQA v2 datasets to demonstrate the effectiveness of our proposed method.
Paper Type: Long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: vision question answering
Contribution Types: Surveys
Languages Studied: English
Submission Number: 356
Loading