Multimodal Visual-Language Prompt Network for Remote Sensing Few-Shot Segmentation

Zhenhao Yang, Fukun Bi, Jianhong Han, Xianping Ma, Chenglong He, Wenkai Liu

Published: 2025, Last Modified: 05 Nov 2025IEEE Trans. Geosci. Remote. Sens. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Few-shot segmentation (FSS) aims to segment objects of interest in a query image using a limited set of support images. However, most existing FSS methods are designed for natural images. When extended to remote sensing scenes characterized by extreme intraclass variations and complex backgrounds, these methods struggle to provide robust segmentation guidance, leading to severe performance degradation. To address the aforementioned issues, we propose a multimodal visual-language prompt network (MVLPNet), which employs a collaborative optimization strategy for visual–textual features to tackle the remote sensing FSS task. Specifically, MVLPNet consists of a textual–visual consistency enhancement (TVCE) module and a prototype-guided semantic alignment (PGSA) module. To overcome the limited support set for better guiding the query segmentation, we propose a TVCE module that leverages the contrastive language-image pretraining (CLIP) model to capture category-specific text embeddings. An optimal transport (OT) plan is then established to tightly align these text embeddings with the visual features of the query image, thereby extracting semantic information from the query image itself to mitigate the extreme intraclass variation in remote sensing images. Furthermore, a PGSA module is proposed to suppress interference caused by complex background regions. By aggregating lost foreground regions, more comprehensive support features are extracted. Then, the query and support features are precisely matched to activate consistent foreground regions, rather than ambiguously matching the query features via a single prototype or multiple prototypes. Extensive experiments on the iSAID-5i and LoveDA-2i datasets have demonstrated that our method achieves the state of the art. The code is available at https://github.com/Gritiii/MVLPNet

External IDs:dblp:journals/tgrs/YangBHMHL25