Modality Re-Balance for Visual Question Answering: A Causal Framework

Published: 01 Jan 2024, Last Modified: 06 Feb 2025ICASSP 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Visual Question Answering (VQA) models often prioritize language cues over visual knowledge, leading to the "language prior" phenomenon. To address this, researchers have proposed methods to balance language and image information during training and inference. However, these approaches often struggle to capture important linguistic components due to the excessive exclusion of language information. Inspired by causal inference, we introduce a novel approach called the SyMmetrically Balanced Causal framework (SMBC) that rebalances visual and textual information in VQA tasks. This framework allows for an equal contribution of knowledge from both modalities to inference results. Experimental evaluation shows that SMBC: 1) applies to prevalent VQA models, including those with data augmentation, and 2) consistently improves performance on established benchmarks.
Loading