Fast Imagic: Solving Overfitting in Text-guided Image Editing via Disentangled UNet with Forgetting Mechanism and Unified Vision-Language Optimization

Shiwen Zhang

Fast Imagic: Solving Overfitting in Text-guided Image Editing via Disentangled UNet with Forgetting Mechanism and Unified Vision-Language Optimization

Shiwen Zhang

Published: 10 Oct 2024, Last Modified: 03 Nov 2024UniRepsEveryoneRevisionsBibTeXCC BY 4.0

Track: Proceedings Track

Keywords: Disentanglement, Text-guided Image Editing, Model Merging, vision-language alignment

TL;DR: Fast Imagic speeds up Imagic by 14 times via unified vision-language optimization and completely solves the overfitting of Imagic by disentangled UNet with forgetting mechanism

Abstract: Text-guided image editing on real or synthetic images, given only the original image itself and the target text prompt as inputs, is a very general and challenging task. It requires an editing model to estimate by itself which part of the image should be edited, and then perform either rigid or non-rigid editing while preserving the characteristics of original image. Imagic, the previous SOTA solution to text-guided image editing, suffers from slow optimization speed, and is prone to overfitting since there is only one image given. In this paper, we design a novel text-guided image editing method, Fast Imagic. First, we propose a vision-language joint optimization framework for fast aligning text embedding and UNet with the given image, which is capable of understanding and reconstructing the original image in 30 seconds, much faster and much less overfitting than previous SOTA Imagic. Then we propose a novel vector projection mechanism in text embedding space of Diffusion Models, capable of decomposing the identity similarity and editing strength thus controlling them separately. Finally, we discovered a general disentanglement property of UNet in Diffusion Models, i.e., UNet encoder learns space and structure, UNet decoder learns appearance and texture. With such a property, we design the forgetting mechanism by merging original checkpoint and optimized checkpoint to successfully tackle the fatal and inevitable overfitting issues when fine-tuning Diffusion Models on one image, thus significantly boosting the editing capability of Diffusion Models. Our method, Fast Imagic, even built on the outdated Stable Diffusion, achieves new state-of-the-art results on the challenging text-guided image editing benchmark: TEdBench, surpassing the previous SOTA methods such as Imagic with Imagen, in terms of both CLIP score and LPIPS score.

Submission Number: 27

Loading