everyone
since 13 Oct 2023">EveryoneRevisionsBibTeX
Text-to-image diffusion models have shown remarkable success in synthesizing photo-realistic images. Apart from creative applications, can we use such models to synthesize samples that aid the few-shot training of discriminative models? In this work, we propose AlignDiff, a general framework for synthesizing training images and associated mask annotations for few-shot segmentation. We identify three levels of misalignments that arise when utilizing pre-trained diffusion models in segmentation tasks. These misalignments need to be addressed to create realistic training samples and align the synthetic data distribution with the real training distribution: 1) instance-level misalignment, where generated samples fail to be consistent with the target task (e.g., specific texture or out-of-distribution generation of rare categories); 2) scene-level misalignment, where synthetic samples are object-centric and fail to represent realistic scene layouts with multiple objects; and 3) annotation-level misalignment, where diffusion models are limited to generating images without pixel-level annotations. AlignDiff overcomes these challenges by leveraging a few real samples to guide the generation, thus improving novel IoU over baseline methods in generalized few-shot semantic segmentation on Pascal-5i and COCO-20i by up to 80%. In addition, AlignDiff is capable of augmenting the learning of out-of-distribution categories on FSS-1000, while naive diffusion model generates samples that hurt the training process. The code will be released.