Segment Everything Everywhere All at Once

Xueyan Zou; Jianwei Yang; Hao Zhang; Feng Li; Linjie Li; Jianfeng Wang; Lijuan Wang; Jianfeng Gao; Yong Jae Lee

Segment Everything Everywhere All at Once

Xueyan Zou, Jianwei Yang, Hao Zhang, Feng Li, Linjie Li, Jianfeng Wang, Lijuan Wang, Jianfeng Gao, Yong Jae Lee

Published: 21 Sept 2023, Last Modified: 19 Jan 2024NeurIPS 2023 posterEveryoneRevisionsBibTeX

Keywords: Generaic segmentation, interactive segmentation, referring segmentation, multi-modality prompting.

TL;DR: A universal and interactive model for all kinds of segmentation tasks with multi-modality prompts as inputs.

Abstract: In this work, we present SEEM, a promotable and interactive model for segmenting everything everywhere all at once in an image. In SEEM, we propose a novel and versatile decoding mechanism that enables diverse prompting for all types of segmentation tasks, aiming at a universal interface that behaves like large language models (LLMs). More specifically, SEEM is designed with four desiderata: i) Versatility. We introduce a new visual prompt to unify different spatial queries including points, boxes, scribbles, and masks, which can further generalize to a different referring image; ii) Compositionality. We learn a joint visual-semantic space between text and visual prompts, which facilitates the dynamic composition of two prompt types required for various segmentation tasks, as shown in Fig. 1; iii) Interactivity. We further incorporate learnable memory prompts into the decoder to retain segmentation history through mask-guided cross-attention from the decoder to image features; iv) Semantic awareness. We use a text encoder to encode text queries and mask labels into the same semantic space for open-vocabulary segmentation. We conduct a comprehensive empirical study to validate the effectiveness of SEEM across diverse segmentation tasks. The results demonstrate that SEEM exhibits robust generalizing to unseen user intents as it learns to compose prompts of different types in a unified representation space. Our approach achieves competitive performance on interactive segmentation, generic segmentation, referring segmentation, and video object segmentation on 9 datasets with minimum 1/100 supervision in a single set of weights.

Supplementary Material: pdf

Submission Number: 838

Loading