Abstract: Despite the effectiveness of Segment Anything Model (SAM) based methods in Few-Shot Segmentation (FSS) tasks, our closer examination of their prompt encoding mechanism reveals that these methods rely solely on visual information to generate a single type of prompt. Consequently, they suffer from semantic granularity representation bias and a loss of spatial information. To address these limitations, this paper introduces an innovative multi-modal prompt encoder, enabling SAM to leverage both annotated reference images and textual descriptions of class names as segmentation prompts. This approach generates text prompts, dense visual prompts, and sparse visual prompts, spanning multiple modalities and granularities. These prompts provide enhanced representations of the target class, capturing both abstract semantics and specific details, while ensuring granularity appropriateness. When our multi-modal prompt encoder is integrated with SAM's image encoder and mask decoder, the overall model is referred to as MM-Prompt. To validate its effectiveness, we conducted extensive empirical studies on the PASCAL-5^i and COCO-20^i datasets. The experimental results demonstrate that MM-Prompt achieves state-of-the-art performance in FSS tasks, highlighting its substantial potential and value in this domain.
External IDs:doi:10.1145/3746027.3754967
Loading