Dual Context Perception Transformer for Referring Image Segmentation

Yuqiu Kong, Junhua Liu, Cuili Yao

Published: 2024, Last Modified: 28 Feb 2026PRCV (5) 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Referring image segmentation segments target objects in the image according to language expressions. Existing methods mainly make efforts to integrate multi-modal features with attention mechanisms. However, most methods tend to incline to the feature of a single modal during the fusion stage and fall short in exploring cross-modal contextual information, which is critical in localizing accurate target regions. To this end, we propose a novel architecture named Dual Context Perception Transformer (DCPformer) which considers both visual and linguistic contextual information during the fusion and reasoning stages. Specifically, a Cross-modal Context-aware Perception Module (CCPM) is designed to model cross-modal alignment in a unified visual-linguistic representation space. Furthermore, we propose an Information Feedback Module (IFM) that generates a rectification mask based on deep-scale features and filters unrelated signals of the target object in features of shallower scales. Extensive experiments show that the proposed DCPformer achieves state-of-the-art performances against existing methods on three challenging benchmarks.

External IDs:dblp:conf/prcv/KongLY24