METRIS: Multi-Expressions for Transformer-based Referring Image Segmentation

Yubin Cho; Hyunwoo Yu; Kyeongbo Kong; Suk-Ju Kang

METRIS: Multi-Expressions for Transformer-based Referring Image Segmentation

Yubin Cho, Hyunwoo Yu, Kyeongbo Kong, Suk-Ju Kang

22 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Multi-expression guidance framework, Target-oriented visual expression, Vision-Language representation, Transformer-based Segmentation

TL;DR: We propose a novel Multi-Expression guidance framework for Transformer-based Referring Image Segmentation, which enables the introduction of visual expression as elements of the guidance set alongside linguistic expression.

Abstract: Referring image segmentation (RIS) aims to precisely segment a target object described by a linguistic expression. Recent RIS methods have introduced Transformer-based networks that use vision features as query and linguistic expression features as key-value to find target regions by referring to the given linguistic information. Since the Transformer-based network predicts based on the guidance information that guides the network on which regions to pay attention, the capacity of this guidance information has a significant impact on segmentation results in Transformer-based RIS. However, existing methods rely only on linguistic tokens as the guidance elements, which are limited in providing the visual understanding of the fine-grained target regions. To address this issue, we present a novel Multi-Expression guidance framework for Transformer-based Referring Image Segmentation, METRIS, which allows the network to refer to the visual expression tokens as the guidance information alongside the linguistic expression tokens. The introduction of visual expression can complement the capability of linguistic guidance by effectively providing the target-informative visual contexts. To generate the visual expression from vision features, we introduce a visual expression extractor that is designed to endow with the target-informative visual guidance ability and to acquire rich contextual information. This module strengthens the adaptability to the diverse image and language inputs, and improves visual understanding of the fine-grained target regions. Extensive experiments demonstrate the effectiveness of our approach across the commonly used RIS settings and the generalizability evaluation settings. Our method consistently shows strong performance on three public RIS benchmarks.

Supplementary Material: zip

Primary Area: applications to computer vision, audio, language, and other modalities

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 2690

Loading