TL;DR: We introduce ENM Inversion, a technique that improves text-guided image editing by enhancing noise map editability.
Abstract: Text-to-image diffusion models have achieved remarkable success in generating high-quality and diverse images. Building on these advancements, diffusion models have also demonstrated exceptional performance in text-guided image editing. A key strategy for effective image editing involves inverting the source image into editable noise maps associated with the target image. However, previous inversion methods face challenges in adhering closely to the target text prompt. The limitation arises because inverted noise maps, while enabling faithful reconstruction of the source image, restrict the flexibility needed for desired edits. To overcome this issue, we propose Editable Noise Map Inversion (ENM Inversion), a novel inversion technique that searches for optimal noise maps to ensure both content preservation and editability. We analyze the properties of noise maps for enhanced editability. Based on this analysis, our method introduces an editable noise refinement that aligns with the desired edits by minimizing the difference between the reconstructed and edited noise maps. Extensive experiments demonstrate that ENM Inversion outperforms existing approaches across a wide range of image editing tasks in both preservation and edit fidelity with target prompts. Our approach can also be easily applied to video editing, enabling temporal consistency and content manipulation across frames.
Lay Summary: Recent advances in AI have made it possible to create highly realistic and diverse images from text descriptions. But what if we want to edit an existing image — like changing a cat into a dog, or making a beach scene look like a winter landscape — based on a new text prompt? This is tricky, because current methods that reverse an image into a format the model can work with often stick too closely to the original, limiting how much can be changed.
To address this, we propose a new method called Editable Noise Map Inversion (ENM Inversion). This technique finds the right "noise map" (an internal representation in the model) that keeps the image's key details while allowing for flexible edits guided by text. Our approach leads to better results in a wide range of editing tasks, staying true to both the original image and the new prompt. ENM Inversion also works for video editing, helping create smooth, consistent changes across frames.
Primary Area: Deep Learning->Generative Models and Autoencoders
Keywords: diffusion models, image editing, diffusion inversion
Submission Number: 13199
Loading