VEGA: Visual Expression Guidance for Referring Expression Segmentation

Yubin Cho; Hyunwoo Yu; Kyeongbo Kong; Suk-Ju Kang

VEGA: Visual Expression Guidance for Referring Expression Segmentation

Yubin Cho, Hyunwoo Yu, Kyeongbo Kong, Suk-Ju Kang

17 Sept 2023 (modified: 25 Mar 2024)ICLR 2024 Conference Withdrawn SubmissionEveryoneRevisionsBibTeX

Keywords: referring expression segmentation, visual-linguistic guidance set, visual information selection, vision and language

TL;DR: We propose a novel framework for referring expression segmentation, which produces the visual expression by flexibly selecting the useful visual information relevant to the target region to enhance the robustness of the guidance set.

Abstract:

Referring expression segmentation aims to segment a target object described by a given linguistic expression in an image. Unlike the unimodal segmentation taking predefined categories, this task takes the free-form linguistic expression that contains a single attribute or more than one attribute (e.g., location, color and action) related to the target object. However, the given linguistic information is only some part of information on the target object. In contrast, the image contains more additional information for the target object, including the unique information that is hard to describe in linguistic expression. Motivated by this, we propose a novel Visual Expression GuidAnce framework for referring expression segmentation, VEGA, which enables the network to refer to the visual expression that complements the linguistic expression information to improve the guidance capability. Since the image includes information related to both target and non-target regions, it needs to meticulously identify and selectively extract the useful information relevant to the target object. Therefore, we introduce a novel visual information selection module that flexibly selects the semantic visual information related to the target object to produce the visual expression, enhancing the adaptability to diverse linguistic and image contexts for robust segmentation. Furthermore, the proposed module allows each token of the visual expression to consider the visual contextual information by exploiting the global-local linguistic cues, thereby enhancing the capacity to understand the context of the target region. Our method consistently shows strong performance on three public benchmarks for referring expression segmentation, where it surpasses the existing state-of-the-art methods.

Supplementary Material: zip

Primary Area: general machine learning (i.e., none of the above)

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 903

Loading