Abstract: Visual Question Answering (VQA) models often answer questions based on the superficial correlations between question-answer pairs rather than actual reasoning. It leads to good performance on in-distributed (ID) datasets and a significant decline on out-of-distributed (OOD) datasets. Existing debiasing methods primarily focus on single-modal branches, with few studies addressing multi-modal bias simultaneously. A bias extraction and penalty (BEP) method was proposed in this paper. Using a generative adversarial network and knowledge distillation strategy, the bias is extracted directly from the VQA model and incorporated into the bias model. Furthermore, margin penalty is introduced to represent the frequency of certain answer types and the difficulty of sample answers as margin information. The size of the margin reflects the degree of bias, and different penalties are assigned to samples with varying degrees of bias. Supervised contrastive learning is employed to retain these penalties, enabling the model to focus more on training biased samples. Additionally, a classifier based on Cross-Entropy(CE) loss was proposed, which has a stronger inference ability on ID datasets, and uses the main classifier and CE loss classifier for joint inference. Experiments on the challenging VQA-CPv2 and VQA v2 datasets show that BEP achieves state-of-the-art results among non-augmentation debiasing methods while maintaining competitive performance on ID datasets.
External IDs:dblp:journals/apin/ZhangHPL25
Loading