Visual Expression for Referring Expression Segmentation

ACL ARR 2024 June Submission2066 Authors

15 Jun 2024 (modified: 05 Aug 2024)ACL ARR 2024 June SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Referring expression segmentation aims to segment a target object precisely in the image by referencing to a given linguistic expression. Since the network predicts based on the reference information that guides the network on which regions to pay attention, the capacity of this guidance information has a significant impact on the segmentation result. However, most existing methods rely on linguistic context-based tokens as the guidance elements, which are limited in providing the visual understanding of the fine-grained target regions. To address this issue, we propose a novel Multi-Expression Guidance framework for Referring Expression Segmentation, MERES, which enables the network to refer to the visual expression tokens as well as the linguistic expression tokens to complement the linguistic guidance capacity by effectively providing the visual contexts of the fine-grained target regions. To produce the semantic visual expression tokens, we introduce a visual expression extractor that adaptively selects the useful visual information relevant to the target regions from the image context and allows the visual expression to capture the richer visual contexts. The proposed module strengthens the adaptability to the diverse image and language inputs, and improves visual understanding of the target regions. Our method consistently shows strong performance on three public benchmarks, where it surpasses the existing state-of-the-art methods.
Paper Type: Long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: multimodality,model architectures,cross-modal information extraction,segmentation
Contribution Types: Model analysis & interpretability
Languages Studied: English
Submission Number: 2066
Loading