Boosting remote semantic segmentation using vision-and-language foundation model

Qiuyue Zhang, Zhiwang Zhang, Shiting Wen, Chaoyi Pang, Fangyu Wu

Published: 2025, Last Modified: 15 Jan 2026Vis. Comput. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: In recent years, visual analysis and processing of remote sensing images have become increasingly popular. Vision-language foundation models (such as RemoteCLIP) embed rich prior knowledge from a large number of remote sensing images through extensive pre-training. Although these models perform well in image-level tasks, their prior knowledge has not been fully utilized in pixel-level segmentation tasks. To address this issue, we propose a lightweight fusion framework named Remote Foundation Model for Segmentation (RFM-Seg). This framework trains V-branch connectors, VL-branch connectors, and the VL-map module while freezing both the foundation model and the remote sensing segmentation model. These modules effectively integrate multi-scale and multi-modal prior knowledge from remote sensing images into mainstream remote sensing segmentation models, thereby enhancing the model’s performance in pixel-level segmentation tasks. We validated the effectiveness of this model framework on four challenging aerial image segmentation benchmark datasets, including ISPRS Vaihingen, ISPRS Potsdam, Aerial, and LoveDA Urban. Experimental results demonstrate that RFM-Seg achieves state-of-the-art performance while maintaining highly efficient training and inference. The source code will be released at https://github.com/NBTAILAB/ RFM-Seg.

External IDs:dblp:journals/vc/ZhangZWPW25