Abstract: Referring Image Segmentation (RIS) aims to semantically segment the target object (referent) in alignment with the provided natural language query. Existing works still suffer from that the non-referent was segmented mistakenly, which can be attributed to the insufficient comprehension of vision and language. To tackle this problem, we propose a Cross-Modal Interactive Reasoning Network (CMIRNet) to explore semantic information that consistently existed between vision and language. Specifically, we first devise a novel Text-Guided Multi-Modality Joint Encoder (TGMM-JE), where the key expression can be extracted and the important visual features will be encoded under the continuous guidance of language expression. Then, we design a Cross-Graph Interactive Positioning (CGIP) module to locate the key pixels of the referent object in deepest layer. The multi-modality graph data is constructed between visual and linguistic features, and the important pixels can be positioned from cross-graph interaction and intra-graph reasoning. Finally, a novel Cross-Modal Attention Enhanced DEcoder (CMAE-DE) is dedicated to refine the referent object mask from coarse to fine progressively, where hybrid cross modal attentions are explored to enhance the representation of referent object. Extensive ablation studies validate the efficacy of our key modules and comprehensive experimental results show the superiority of our proposed model over 22 state-of-the-art (SOTA) models.
Loading