AlignDiff: Aligning Diffusion Models for General Few-Shot Segmentation

22 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX
Primary Area: representation learning for computer vision, audio, language, and other modalities
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: Few-shot learning, image segmentation, image synthesis, training synthesis
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
Abstract: Text-to-image diffusion models have shown remarkable success in synthesizing photo-realistic images. Apart from creative applications, can we use such models to synthesize samples that aid the few-shot training of discriminative models? In this work, we propose AlignDiff, a general framework for synthesizing training images and associated mask annotations for few-shot segmentation. We identify three levels of misalignments that arise when utilizing pre-trained diffusion models in segmentation tasks. These misalignments need to be addressed to create realistic training samples and align the synthetic data distribution with the real training distribution: 1) instance-level misalignment, where generated samples fail to be consistent with the target task (e.g., specific texture or out-of-distribution generation of rare categories); 2) scene-level misalignment, where synthetic samples are object-centric and fail to represent realistic scene layouts with multiple objects; and 3) annotation-level misalignment, where diffusion models are limited to generating images without pixel-level annotations. AlignDiff overcomes these challenges by leveraging a few real samples to guide the generation, thus improving novel IoU over baseline methods in generalized few-shot semantic segmentation on Pascal-5i and COCO-20i by up to 80%. In addition, AlignDiff is capable of augmenting the learning of out-of-distribution categories on FSS-1000, while naive diffusion model generates samples that hurt the training process. The code will be released.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 6243
Loading