Abstract: In Visual Question Answering (VQA), both the image and its accompanying question serve as the primary sources of information for the model. Conventional approaches typically rely heavily on dense visual representations for reasoning and answer prediction. However, when the visual and textual modalities are imbalanced or semantically misaligned, such disparities hinder effective multimodal learning and inference. To address this issue, we propose a multimodal information adjustment method, the Visual Text Information Adjuster (ViTA). ViTA investigates the impact of embedding textual cues within images on the VQA process and promotes cross-modal balance to improve accuracy. Specifically, since image content often dominates over question content, ViTA adjusts the balance by either masking visual information or augmenting it with object-word visual cues directly embedded in the image. Experimental results validate our hypothesis and further demonstrate that ViTA can serve as an effective data augmentation strategy, yielding measurable improvements across multiple VQA models. The code will be released at https://github.com/xqx23/ViTA.
External IDs:doi:10.1145/3805048
Loading