Make "V" and "Q" Inseparable: Deliberately Dual-Channel Adversarial Learning for Robust Visual Question Answering
Abstract: Visual Question Answering (VQA) is a challenging task due to the vision-language biases which restrict the model to sufficiently learn the multi-modal knowledge from visual image and natural language simultaneously. Several recent works attempt to alleviate this problem via weakening language prior but ignore vision prior, hindering further performance improvement. In this paper, we propose a novel Deliberately Dual-Channel Adversarial Learning (DCAL) to make "V" and "Q" inseparable, which aims to weaken prior from both vision and language. Specifically, DCAL introduces in-batch random negative sampling to force the model to be wrong when given the wrong questions or images. DCAL maximizes the likelihood of correct answers for the original question-image pairs and minimizes it for random negative samples. In order to solve the problem of false negatives, DCAL exploits a deliberate training strategy to utilize the sampled question-image pairs. The proposed DCAL is model-agnostic and can be applied to various VQA models. Experiments demonstrate that our proposed DCAL method improves the performance of existing robust VQA models on the sensitive VQA-CP dataset while performing robustly on the balanced VQA v2 dataset.
External IDs:dblp:conf/ijcnn/WuLCWXHCLWC25
Loading