Prompting to Adapt Foundational Segmentation Models

Published: 20 Jul 2024, Last Modified: 21 Jul 2024MM2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Foundational segmentation models, predominantly trained on scenes typical of natural environments, struggle to generalize across varied image domains. Traditional "training-to-adapt" methods rely heavily on extensive data retraining and model architectures modifications. This significantly limits the models' generalization capabilities and efficiency in deployment. In this study, we propose a novel adaptation paradigm, termed "prompting-to-adapt", to tackle the above issue by introducing an innovative image prompter. This prompter generates domain-specific prompts through few-shot image-mask pairs, incorporating diverse image processing techniques to enhance adaptability. To tackle the inherent non-differentiability of image prompts, we further devise an information-estimation-based gradient descent strategy that leverages the information entropy of image processing combinations to optimize the prompter, ensuring effective adaptation. Through extensive experiments across nine datasets spanning seven image domains (\emph{i.e.}, depth, thermal, camouflage, endoscopic, ultrasound, grayscale, and natural) and four scenarios (\emph{i.e.}, common scenes, camouflage objects, medical images, and industrial data), we demonstrate that our approach significant improves the foundational models' adaptation capabilities. Moreover, the interpretability of the generated prompts provides insightful revelations into their image processing mechanisms. Our source code will be publicly available to foster further innovation and exploration in this field.
Primary Subject Area: [Generation] Multimedia Foundation Models
Secondary Subject Area: [Content] Vision and Language
Relevance To Conference: This work contributes to multimedia/multimodal processing by introducing a novel "prompting-to-adapt" paradigm that enhances the adaptability and generalization of foundational segmentation models across diverse image domains. The innovative image prompter, capable of generating domain-specific prompts from few-shot image-mask pairs, integrates a variety of image processing techniques. This integration allows the models to effectively handle different types of data, such as depth, thermal, camouflage, endoscopic, ultrasound, grayscale, and natural images, thereby improving their performance in various scenarios including common scenes, camouflage objects, medical images, and industrial data. Moreover, the proposed information-estimation-based gradient descent strategy addresses the non-differentiability of image prompts, which is a common challenge in multimodal processing. By leveraging the information entropy of image processing combinations, the strategy optimizes the prompter and ensures effective adaptation without the need for extensive data retraining or model architecture modifications. The enhanced adaptability and generalization capabilities provided by this approach are crucial for multimedia/multimodal processing, as they allow models to perform well on data that differs significantly from the data on which they were originally trained. The interpretability of the generated prompts also offers valuable insights, promoting a better understanding of how models process and adapt to different modalities of data.
Submission Number: 768
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview