Scene Graph-Aware Hierarchical Fusion Network for Remote Sensing Image Retrieval With Text Feedback

Fei Wang, Xianzhang Zhu, Xiaojian Liu, Yongjun Zhang, Yansheng Li

Published: 01 Jan 2024, Last Modified: 13 Nov 2024IEEE Trans. Geosci. Remote. Sens. 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: In the realm of image retrieval with text feedback, existing studies have predominantly concentrated on the intrinsic attribute of target objects, neglecting extrinsic information essential for remote sensing (RS) images, such as spatial relationships. This research addresses this gap by incorporating RS image scene graphs as side information, given their capacity to encapsulate internal object attributes, external structural features between objects, and the relationships among images. To fully leverage the features from the reference RS image, scene graph, and modifier sentence, we propose a scene graph-aware hierarchical fusion network (SHF), which optimally integrates the multimodal features in a two-stage fusion process. Initially, image and scene graph features are fused hierarchically, followed by transforming content information with a proposed multimodal global content (MGC) block, ultimately transforming style information. To validate the superiority of SHF, we constructed three datasets with images from several popular RS datasets, named Airplane (3461 image + text-image pairs), Tennis (1924 image + text-image pairs), and WHIRT (3344 image + text-image pairs). Extensive experiments conducted on these datasets show that SHF significantly outperforms state-of-the-art methods.