MMSeg: Multi-Modal and Multi-View Driven Semantic Enrichment for Training-Free Image Prompt Segmentation
Keywords: Image Segmentation, Training-Free, Semantic Enrichment, Multi-Modal
Abstract: Rapid development of vision foundation models has fueled interest in training-free image segmentation utilizing image prompts. Current methods typically involve a single image and its corresponding mask as references, relying on high-level feature similarity to generate point prompts for subsequent segmentation. However, these approaches suffer from inaccurate target localization and suboptimal mask quality. In response to these limitations, we propose MMSeg, a training-free Multi-modal and Multi-view image prompt Segmentation framework. MMSeg enhances semantic information by diversifying references through two key components: visual localization augmented by diffusion prior and multi-view cues, alongside text-driven localization from generated pseudo-labels. By leveraging segmentation consistency across multi-view images and complementary strengths of multi-modal cues, these modules facilitate precise target localization. Furthermore, a consensus-oriented mask proposer is devised to filter and refine mask proposals. Experimental results demonstrate the competitive performance of MMSeg, achieving 95.1\% mIoU on the PerSeg dataset, 87.4\% on the FSS dataset, and 52.8\% on the $\text{COCO}\mbox{-}20^{i}$ dataset.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 1497
Loading