One-shot In-context Part Segmentation

Published: 20 Jul 2024, Last Modified: 21 Jul 2024MM2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: In this paper, we present the One-shot In-context Part Segmentation (OIParts) framework, designed to tackle the challenges of part segmentation by leveraging visual foundation models (VFMs). Existing training-based one-shot part segmentation methods that utilize VFMs encounter difficulties when faced with scenarios where the one-shot image and test image exhibit significant variance in appearance and perspective, or when the object in the test image is partially visible. We argue that training on the one-shot example often leads to overfitting, thereby compromising the model's generalization capability. Our framework offers a novel approach to part segmentation that is training-free, flexible, and data-efficient, requiring only a single in-context example for precise segmentation with superior generalization ability. By thoroughly exploring the complementary strengths of VFMs, specifically DINOv2 and Stable Diffusion, we introduce an adaptive channel selection approach by minimizing the intra-class distance for better exploiting these two features, thereby enhancing the discriminatory power of the extracted features for the fine-grained parts. We have achieved remarkable segmentation performance across diverse object categories. The OIParts framework not only eliminates the need for extensive labeled data but also demonstrates superior generalization ability. Through comprehensive experimentation on three benchmark datasets, we have demonstrated the superiority of our proposed method over existing part segmentation approaches in one-shot settings.
Primary Subject Area: [Generation] Multimedia Foundation Models
Secondary Subject Area: [Experience] Multimedia Applications
Relevance To Conference: The paper presents a novel approach that taps into the rich and complementary features extracted from vision foundation models for the downstream task without the need for additional training. To enhance the performance of the proposed paradigm, we introduce several innovative components within the framework. These components are designed to collaborate with the foundation models and unlock their full potential in part segmentation. By introducing this novel paradigm that leverages the capabilities of visual foundation models (VFMs) for part segmentation, this work contributes to the broader field of multimedia/multimodal processing. It paves the way for future research in exploring the potential of VFMs in addressing other downstream tasks and opens up new possibilities for multimedia applications that require accurate and efficient part segmentation.
Submission Number: 2102
Loading