DiffusionCLIP: Text-guided Image Manipulation Using Diffusion Models

Gwanghyun Kim; Jong Chul Ye

DiffusionCLIP: Text-guided Image Manipulation Using Diffusion Models

Gwanghyun Kim, Jong Chul Ye

29 Sept 2021 (modified: 22 Jun 2025)ICLR 2022 Conference Withdrawn SubmissionReaders: Everyone

Keywords: Diffusion models, CLIP, Image manipulation, Image to image translation

Abstract: Diffusion models are recent generative models that have shown great success in image generation with the state-of-the-art performance. However, only a few researches have been conducted for image manipulation with diffusion models. Here, we present a novel DiffusionCLIP which performs text-driven image manipulation with diffusion models using Contrastive Language–Image Pre-training (CLIP) loss. Our method has a performance comparable to that of the modern GAN-based image processing methods for in and out-of-domain image processing tasks, with the advantage of almost perfect inversion even without additional encoders or optimization. Furthermore, our method can be easily used for various novel applications, enabling image translation from an unseen domain to another unseen domain or stroke-conditioned image generation in an unseen domain, etc. Finally, we present a novel multiple attribute control with DiffusionCLIP by combining multiple fine-tuned diffusion models.

One-sentence Summary: we present a DiffusionCLIP which performs text-driven image manipulation with diffusion models using CLIP with the advantage of almost perfect inversion.

Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 1 code implementation](https://www.catalyzex.com/paper/diffusionclip-text-guided-image-manipulation/code)

4 Replies

Loading