Combating Visual Question Answering Hallucinations via Robust Multi-Space Co-Debias Learning

Published: 20 Jul 2024, Last Modified: 21 Jul 2024MM2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: The challenge of bias in visual question answering (VQA) has gained considerable attention in contemporary research. Various intricate bias dependencies, such as modalities and data imbalances, can cause semantic ambiguities to generate shifts in the feature space of VQA instances. This phenomenon is referred to as VQA Hallucinations. Such distortions can cause hallucination distributions that deviate significantly from the true data, resulting in the model producing factually incorrect predictions. To address this challenge, we propose a robust Multi-Space Co-debias Learning (MSCD) approach for combating VQA hallucinations, which effectively mitigates bias-induced instance and distribution shifts in multi-space under a unified paradigm. Specifically, we design bias-aware and prior-aware debias constraints by utilizing the angle and angle margin of the spherical space to construct bias-prior-instance constraints, thereby refining the manifold representation of instance de-bias and distribution de-dependence. Moreover, we leverage the inherent overfitting characteristics of Euclidean space to introduce bias components from biased examples and modal counterexample injection, further assisting in multi-space robust learning. By integrating homeomorphic instances in different spaces, MSCD could enhance the comprehension of structural relationships between semantics and answer classes, yielding robust representations that are not solely reliant on training priors. In this way, our co-debias paradigm generates more robust representations that effectively mitigate biases to combat hallucinations. Extensive experiments on multiple benchmark datasets consistently demonstrate that the proposed MSCD method outperforms state-of-the-art baselines.
Relevance To Conference: The proposed MSCD method significantly contributes to multimodal fusion and interpretation by addressing the issue of VQA hallucination stemming from biases. In particular, it tackles the problem by Spherical and Euclidean spaces co-debias learning into a unified framework. On the one hand, we design the bias-aware and prior-aware debias constraints for spherical debias learning and explicitly construct constraints to calibrate instance shift and distribution shift, thereby alleviating VQA hallucinations. On the other hand, a multi-space co-debias paradigm is proposed by deploying a two-stage strategy of Euclidean space, assisting spherical derbies learning to expose prior correlations and modality-semantics interplay. Extensive experiments on two biased benchmark datasets and a balanced dataset demonstrate the effectiveness of the method to combat VQA hallucination and achieve state-of-the-art performance.
Supplementary Material: zip
Primary Subject Area: [Content] Vision and Language
Secondary Subject Area: [Content] Media Interpretation
Submission Number: 5417
Loading