Abstract: The advance of generative models for images has inspired various training techniques for image recognition utilizing synthetic images. In semantic segmentation, one promising approach is extracting pseudo-masks from attention maps in text-to-image diffusion models, which enables real-image-and-annotation-free training. However, the pioneering training methods using the diffusion-synthetic images and pseudo-masks, e.g., DiffuMask have limitations in terms of mask quality, scalability, and ranges of applicable domains. To address these limitations, we propose a new framework to view diffusion-synthetic semantic segmentation training as a weakly supervised learning problem, where we utilize potentially inaccurate attentive information within the generative model as supervision. Motivated by this perspective, we first introduce reliability-aware robust training, originally used as a classifier-based WSSS method, with modification to handle generative attentions. Additionally, we propose techniques to boost the weakly supervised synthetic training: we introduce prompt augmentation by synonym-and-hyponym replacement, which is data augmentation to the prompt text set to scale up and diversify training images with limited text resources. Finally, LoRA-based adaptation of Stable Diffusion enables the transfer to a distant domain, e.g., auto-driving images. Experiments in PASCAL VOC, ImageNet-S, and Cityscapes show that our method effectively closes gap between real and synthetic training in semantic segmentation. Our code will be available at https://github.com/yahoojapan/attn2mask.
Loading