Abstract: Referring Remote Sensing Image Segmentation (RRSIS) is a new challenge that combines computer vision and natu-ral language processing. Traditional Referring Image Seg-mentation (RIS) approaches have been impeded by the com-plex spatial scales and orientations found in aerial imagery, leading to suboptimal segmentation results. To address these challenges, we introduce the Rotated Multi-Scale In-teraction Network (RMSIN), an innovative approach de-signed for the unique demands of RRSIS. RMSIN incorpo-rates an Intra-scale Interaction Module (IIM) to effectively address the fine-grained detail required at multiple scales and a Cross-scale Interaction Module (CIM) for integrating these details coherently across the network. Furthermore, RMSIN employs an Adaptive Rotated Convolution (ARC) to account for the diverse orientations of objects, a novel contribution that significantly enhances segmentation accu-racy. To assess the efficacy of RMSIN, we have curated an expansive dataset comprising 17,402 image-caption-mask triplets, which is unparalleled in terms of scale and vari-ety. This dataset not only presents the model with a wide range of spatial and rotational scenarios but also estab-lishes a stringent benchmark for the RRSIS task, ensuring a rigorous evaluation of performance. Experimental eval-uations demonstrate the exceptional performance of RM-SIN, surpassing existing state-of-the-art models by a signif-icant margin. Datasets and code are available at https://github.com/Lsan2401/RMSIN.
Loading