Hierarchical Multimodality Graph Reasoning for Remote Sensing Visual Question Answering

Published: 2024, Last Modified: 25 Feb 2025IEEE Trans. Geosci. Remote. Sens. 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Remote sensing visual question answering (RSVQA) targets answering the questions about RS images in natural language form. RSVQA in real-world applications is always challenging, which may contain wide-field visual information and complicated queries. The current methods in RSVQA overlook the semantic hierarchy of visual and linguistic information and ignore the complex relations of multimodal instances. Thus, they severely suffer from vital deficiencies in comprehensively representing and associating the vision–language semantics. In this research, we design an innovative end-to-end model, named Hierarchical Multimodality Graph Reasoning (HMGR) network, which hierarchically learns multigranular vision–language joint representations, and interactively parses the heterogeneous multimodal relationships. Specifically, we design a hierarchical vision–language encoder (HVLE), which could simultaneously represent multiscale vision features and multilevel language features. Based on the representations, the vision–language semantic graphs are built, and the parallel multimodal graph relation reasoning is posed, which could explore the complex interaction patterns and implicit semantic relations of both intramodality and intermodality instances. Moreover, we raise a distinctive vision–question (VQ) feature fusion module for the collaboration of information at different semantic levels. Extensive experiments on three public large-scale datasets (RSVQA-LR, RSVQA-HRv1, and RSVQA-HRv2) demonstrate that our work is superior to the state-of-the-art results toward a mass of vision and query types.
Loading