Alignment and Multimodal Reasoning for Remote Sensing Visual Question Answering

Published: 01 Jan 2024, Last Modified: 26 Jun 2025IGARSS 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Recently, visual question answering for remote sensing data (RSVQA) has emerged as a prominent research area in the field of remote sensing. Transformer-based approaches have demonstrated impressive results, attributed to their superior performance in jointly modeling visual and textual modalities. However, existing Remote Sensing Visual Question Answering (RSVQA) methods often overlook the modality biases present in visual-language interactions, leading to in-accuracies in answers. To address this issue, we propose a novel Transformer-based approach aimed at mitigating modality biases in RSVQA. Specifically, we introduce a contrastive learning loss to align image and text representations before cross-modal fusion, facilitating foundational learning of visual and language representations. Subsequently, we design a cross-modal decoder to comprehensively understand the correlations between images and text. Notably, in addition to predicting answers to questions, we incorporate an extra head for regression prediction of question types. Experimental results demonstrate that our approach achieves higher accuracy in answer prediction compared to state-of-the-art (SoTA) methods, establishing a new record.
Loading