Cross-Modal Progressive Perspective Matching Network for Remote Sensing Image-Text Retrieval

Chengyu Zheng, Xiu Li, Xinyue Liang, Lei Huang, Shan Du, Jie Nie, Junyu Dong

Published: 01 Jan 2025, Last Modified: 18 Jul 2025IEEE Trans. Multim. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Cross-modality based on remote sensing (RS) text-image retrieval has gained increasing attention in recent years due to its ability to leverage the rich semantics of images and the understandability of text to provide a more comprehensive description. Existing cross-modal retrieval methods typically apply self-attention or cross-attention mechanisms to identify important information in RS data, but they ignore the multi-view perception characteristic of geographical space in RS images. As a result, these retrieval models fail to locate the correct perspective in images according to the query text, ultimately leading to incorrect matching. In this work, a Cross-modal Progressive Perspective Matching Network (CPPMN) is proposed for remote sensing image-text retrieval by establishing a progressive perspective matching mechanism and semantic alignment to further improve the performance of the retrieval model. Specifically, the CPPMN framework consists of three core modules: the Compensation Network for Full Perspective Modeling (CN_FPM), the Graph Transformation for Individual Perspective Modeling (GT_IPM), and the Cascaded Transformer for Cross-modal Semantic Alignment (CT_CSA). The CN_FPM module utilizes all positive text samples as supervision signals to guide the feature extraction training process, aiming to capture full perspective information from images. Subsequently, the GT_IPM module transforms implicit-perspective feature representations into explicit-perspective cross-modal relationship graphs. This transformation enables the identification of specific perspective locations within the image according to the query sentence by analyzing graph density and connectivity. Finally, the CT_CSA module comprises a cascaded Transformer network that aligns features at the semantic level between cross-modal data The quantitative and qualitative experiments are conducted on four large-scale remote sensing cross-modal retrieval datasets to demonstrate the significant performance of adopting the progressive perspective matching mechanism and semantic alignment strategy.