Rethinking the Implicit Optimization Paradigm with Dual Alignments for Referring Remote Sensing Image Segmentation
Abstract: Referring Remote Sensing Image Segmentation (RRSIS) is a challenging task that aims to identify specific regions in aerial images that are relevant to given textual conditions. Existing methods tend to adopt the paradigm of implicit optimization, utilizing a framework consisting of early cross-modal feature fusion and a fixed convolutional kernel-based predictor, neglecting the inherent inter-domain gap and conducting class-agnostic predictions. In this paper, we rethink the issues with the implicit optimization paradigm and address the RRSIS task from a dual-alignment perspective. Specifically, we prepend the dedicated Dual Alignment Network (DANet), including an explicit alignment strategy and a reliable agent alignment module. The explicit alignment strategy effectively reduces domain discrepancies by narrowing the inter-domain affinity distribution. Meanwhile, the reliable agent alignment module aims to enhance the predictor's multi-modality awareness and alleviate the impact of deceptive noise interference. Extensive experiments on two remote sensing datasets demonstrate the effectiveness of our proposed DANet in achieving superior segmentation performance without introducing additional learnable parameters compared to state-of-the-art methods.
Primary Subject Area: [Content] Vision and Language
Relevance To Conference: Referring remote sensing image segmentation (RRSIS), which aims at identifying specific regions in aerial images relevant to given text conditions, involving both vision and language modalities, is an extremely challenging task. In this paper, we rethink the issues with the implicit optimization paradigm and address the RRSIS task from a dual-alignment perspective. Our work focuses on multimodal alignment in the important domain of remote sensing images to achieve precise segmentation, aligning with the topic scope of ACM MM.
Supplementary Material: zip
Submission Number: 3363
Loading