Abstract: Referring image segmentation (RIS) aims to segment a particular region based on a specific expression. Existing one-stage methods have explored various fusion strategies, yet they encounter two significant issues. Primarily, most methods rely on manually selected visual features from the visual encoder layers, lacking the flexibility to selectively focus on language-preferred visual features. Moreover, the direct fusion of word-level features into coarse aligned features disrupts the established vision-language alignment, resulting in suboptimal performance. In this paper, we introduce an innovative framework for RIS that seeks to overcome these challenges with adaptive alignment of vision and language features, termed the Adaptive Selection with Dual Alignment (ASDA). ASDA innovates in two aspects. Firstly, we design an Adaptive Feature Selection and Fusion (AFSF) module to dynamically select visual features focusing on different regions related to various descriptions. AFSF is equipped with scale-wise feature aggregator to provide hierarchically coarse features that preserve crucial low-level details and provide robust features for successor dual alignment. Secondly, a Word Guided Dual-Branch Aligner (WGDA) is leveraged to integrate coarse features with linguistic cues by word-guided attention, which effectively addresses the common issue of vision-language misalignment by ensuring that linguistic descriptors directly interact with masks prediction. This guides the model to focus on relevant image regions and make robust prediction. Extensive experimental results demonstrate that our ASDA framework surpasses state-of-the-art methods on RefCOCO, RefCOCO+ and G-Ref benchmark. The improvement not only underscores the superiority of ASDA in capturing fine-grained visual details but also its robustness and adaptability to diverse descriptions.
Primary Subject Area: [Content] Vision and Language
Secondary Subject Area: [Content] Multimodal Fusion
Relevance To Conference: Referring image segmentation (RIS) predicts pixel-wise masks for objects described in language, crucial for applications like human-robot interaction and image editing. The primary challenge is aligning visual content with referring expression accurately at the pixel level. Our analysis of previous methods highlighted two main issues: 1. reliance on static, manually selected visual features; 2. noise introduction from direct fusion of visual features with language feature, which disrupts vision-language alignment. To address these challenges, we introduce the Adaptive Selection with Dual Alignment (ASDA) framework, which includes two key modules: Adaptive Feature Selection and Fusion (AFSF) and Word Guided Dual-Branch Aligner (WGDA). The AFSF module dynamically selects visual features based on the language input, moving away from fixed-layer selection to a responsive mechanism. The WGDA module enhances alignment and interaction between word-level language features and visual features, which uses a dual-branch structure ensuring that linguistic descriptors directly interact with masks prediction. We believe our research offers a new perspective on the fusion and alignment of multimodal features, and merits further exploration in future work.
Supplementary Material: zip
Submission Number: 619
Loading