Abstract: Few-shot semantic segmentation (FSS) aims to locate pixels of unseen classes with clues from a few labeled samples. Recently, thanks to profound prior knowledge, diffusion models have been expanded to achieve FSS tasks. However, due to probabilistic noising and denoising processes, it is difficult for them to maintain spatial relationships between inputs and outputs, leading to inaccurate segmentation masks. To address this issue, we propose a Diffusion-based Segmentation network (DiffSeg), which decouples probabilistic denoising and segmentation processes. Specifically, DiffSeg leverages attention maps extracted from a pretrained diffusion model as support-query interaction information to guide segmentation, which mitigates the impact of probabilistic processes while benefiting from rich prior knowledge of diffusion models. In the segmentation stage, we present a Perceptual Attention Module (PAM), where two cross-attention mechanisms capture semantic information of support-query interaction and spatial information produced by the pretrained diffusion model. Furthermore, a self-attention mechanism within PAM ensures a balanced dependence for segmentation, thus preventing inconsistencies between the aforementioned semantic and spatial information. Additionally, considering the uncertainty inherent in the generation process of diffusion models, we equip DiffSeg with a Spatial Control Module (SCM), which models spatial structural information of query images to control boundaries of attention maps, thus aligning the spatial location between knowledge representation and query images. Experiments on PASCAL-5$^i$ and COCO datasets show that DiffSeg achieves new state-of-the-art performance with remarkable advantages.
Primary Subject Area: [Content] Vision and Language
Secondary Subject Area: [Content] Vision and Language
Relevance To Conference: In multimedia processing, new data may come in very small amounts. Focused on visual data, this work contributes by providing a flexible and efficient way to understand and analyze new images. It can quickly adapt to new classes or objects within the multimedia content, even when only a few examples are available.
In this work, we extract prior knowledge of a diffusion model for perceptual attention, which decouples probabilistic processes with segmentation, thus avoiding uncertain results while utilizing profound prior knowledge. Besides, considering the probabilistic generation of diffusion models, we present an edge control module to align semantic boundaries between extracted attention maps and query images.
Supplementary Material: zip
Submission Number: 3481
Loading