Keywords: diffusion models, text-to-image synthesis, image editing
TL;DR: CycleDiffusion allows stochastic diffusion models to do text-guided real image editing
Abstract: Recent text-to-image diffusion models trained on large-scale data achieve remarkable performance on text-conditioned image synthesis (e.g., GLIDE, DALL∙E 2, Imagen, Stable Diffusion). This paper introduces a simple method to use stochastic text-to-image diffusion models as zero-shot image editors. Our method, CycleDiffusion, is based on the finding that when all random variables (or ``random seed'') are fixed, two similar text prompts will produce similar images. The core of our idea is to infer the random variables that are likely to generate a source image conditioned on a source text. With the inferred random variables, the text-to-image diffusion model then generates a target image conditioned a target text. Our experiments show that CycleDiffusion outperforms SDEdit and the ODE-based DDIB method, and it can be further improved by Cross Attention Control. Demo: https://huggingface.co/spaces/ChenWu98/Stable-CycleDiffusion.
Student Paper: Yes