Segment Anyword: Mask Prompt Inversion for Open-Set Grounded Segmentation

Published: 01 May 2025, Last Modified: 23 Jul 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: We introduce Segment Anyword, a training-free visual prompt learning framework with test-time inversed adaption for open-set language grounded segmentation, where visual prompts are simultaneously regularized by linguistic structual information.
Abstract: Open-set image segmentation poses a significant challenge because existing methods often demand extensive training or fine-tuning and generally struggle to segment unified objects consistently across diverse text reference expressions. Motivated by this, we propose Segment Anyword, a novel training-free visual concept prompt learning approach for open-set language grounded segmentation that relies on token-level cross-attention maps from a frozen diffusion model to produce segmentation surrogates or *mask prompts*, which are then refined into targeted object masks. Initial prompts typically lack coherence and consistency as the complexity of the image-text increases, resulting in suboptimal mask fragments. To tackle this issue, we further introduce a novel linguistic-guided visual prompt regularization that binds and clusters visual prompts based on sentence dependency and syntactic structural information, enabling the extraction of robust, noise-tolerant mask prompts, and significant improvements in segmentation accuracy. The proposed approach is effective, generalizes across different open-set segmentation tasks, and achieves state-of-the-art results of 52.5 (+6.8 relative) mIoU on Pascal Context 59, 67.73 (+25.73 relative) cIoU on gRefCOCO, and 67.4 (+1.1 relative to fine-tuned methods) mIoU on GranDf, which is the most complex open-set grounded segmentation task in the field.
Lay Summary: Segmenting objects in images based on textual descriptions is challenging for computers, especially when those descriptions vary. Different people might describe the same object in different ways. For example, a doctor and a patient, or a child and an adult, might use different terms to refer to the same thing. This variation in expression makes it difficult for current systems to consistently identify the correct object in images. We propose a new approach to make these systems more flexible and robust. We adapt a pre-trained model to learn how to recognize specific concepts described in varied language. We found that the way this model connects words and images internally can serve as a powerful guide for identifying objects. We further incorporate basic linguistic rules, such as how adjectives relate to nouns, to help the system handle noisy or ambiguous guidance more effectively. Our method is training-free, as it works entirely at test time without requiring access to a training or tuning dataset. We demonstrate that our approach achieves stable and accurate performance across a range of segmentation tasks, with promising results in segmenting not only nouns (objects), but also predicates in object-object relationships and human-object interactions.
Primary Area: Applications->Computer Vision
Keywords: Visual Prompt Learning, Grounded Segmentation, Multimodal Prompt Learning, Diffusion Model
Submission Number: 3702
Loading