PMCFNet: Prompt-Guided Multi-scale Cross-Modal Fusion Network for Referring Remote Sensing Image Segmentation

Yuqiu Kong, Wenjie Wu, Zijian Wang, Shenglan Liu

Published: 2025, Last Modified: 28 Feb 2026PRCV (5) 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Referring Remote Sensing Image Segmentation (RRSIS) is a challenging task that advances scene understanding in the field of remote sensing image processing. Existing methods predominantly focus on enhancing visual features with auxiliary text inputs, while overlooking mutual cross-modal interactions. In this work, we propose a novel Prompt-Guided Multi-Scale Cross-Modal Fusion Network (PMCFNet) to achieve fine-grained semantic alignment between visual and textual features. The PMCFNet primarily consists of a Dual-Parallel Interaction Module (DPIM) and Cross-layer Multi-scale Fusion Module (CMFM). The DPIM establishes a parallel architecture to facilitate bi-directional information flow, enabling the exploration of complementary information of multi-modalities. The CMFM combines visual and textual features across multiple layers, enhancing the understanding of both global and local features. Additionally, we introduce a prompt-guided learning strategy to further enrich textual representations by embedding location and target-specific knowledge. This strategy is particularly critical for identifying multi-scale objects from high-resolution remote sensing scenarios, significantly improving the model’s discriminative capabilities. Experiments on the RefSegRS and RRSIS-D benchmarks demonstrate that our method achieves state-of-the-art performance, i.e., yielding improvements of 1.91% in mIoU and 1.41% in oIoU on the challenging RRSIS-D dataset. The source code will be publicly available at https://github.com/woaiqianzhihe/PMDF

External IDs:dblp:conf/prcv/KongWWL25