A Cross-Modal Semantic Mapping Enhancement Model for Remote Sensing Visual Question Answering

Published: 01 Jan 2024, Last Modified: 01 Oct 2024IGARSS 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Remote sensing visual question answering (RSVQA) aims to answer the questions based on the content in remote sensing (RS) images. Due to the complexity of RS images, it is challenging to focus on regions relevant to the questions in the RS images. To this end, we propose a channel-selective multi-scale cross-attention (CSCa) model for RSVQA tasks. Specifically, we design a text-driven multi-scale feature extractor to extract question-related features in RS images. To obtain the cross-attention map in this extractor, we design a novel channel selection mechanism to capture more accurate question-related regions in RS images and develop a channel-wise contrastive learning task to align the semantics between image and text features. We set up experiments on RSVQA-LR and RSIVQA datasets. Experiment results show that our CSCa achieves excellent performance.
Loading