Image Segmentation with Vision-Language Models

Published: 01 Jan 2023, Last Modified: 26 Aug 2024CSAI 2023EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Image segmentation traditionally relies on predefined object classes, which can pose challenges when accommodating new categories or complex queries, often necessitating model retraining. Relying solely on visual information for segmentation heavily depends on annotated samples, and as the number of unknown classes increases, the model’s segmentation performance experiences significant declines. To address these challenges, this paper introduces ViLaSeg, an innovative image segmentation model that generates binary segmentation maps for query images based on either free-text prompts or support images. Our model capitalizes on text prompts to establish comprehensive contextual logical relationships, while visual prompts harness the power of the GroupViT encoder to capture local features of multiple objects, enhancing segmentation precision. By employing selective attention and facilitating cross-modal interactions, our model seamlessly fuses image and text features, further refined by a transformer-based decoder designed for dense prediction tasks. ViLaSeg excels across a spectrum of segmentation tasks, including referring expression, zero-shot, and one-shot segmentation, surpassing prior state-of-the-art approaches.
Loading