Abstract: Referring Remote Sensing Image Segmentation (RRSIS) aims to segment specific regions from remote sensing images based on textual descriptions. Due to the complexity and diversity of remote sensing images, and the challenges posed by multiple scales and orientations, traditional Referring Image Segmentation (RIS) methods yield suboptimal results. To address these challenges, we propose a novel RRSIS network, namely Vision-Text Interaction with Orientation-Awareness (VOA). Specifically, we design a Variable Scale Rotated Convolution Module (VSRC) to capture multi-scale and rotational targets, thereby obtaining orientation information. Meanwhile, a Bidirectional Interactive Enhancement Module (BIEM) integrates the orientation information into vision-text interaction to facilitate cross-modal feature alignment and intra-modal feature enhancement. Furthermore, we introduce a Fine-Grained Gated Fusion Module (FGF) to promote fine-grained vision-text alignment, which is crucial for accurate segmentation mask prediction. Extensive experiments demonstrate that our method significantly outperforms previous state-of-the-art methods on all currently available datasets. The code is available at https://github.com/Wlittles/VOA.
External IDs:dblp:conf/icann/WuLXZY25
Loading