Step-injection reconstruction guidance for improving single aspect real image editing

Step-injection reconstruction guidance for improving single aspect real image editing

ICLR 2026 Conference Submission17442 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Image-to-image editing, Single aspect editing, Latent diffusion model

TL;DR: A step-injection reconstruction function added to Pix2pix-zero to improve single aspect image editing

Abstract: Latent Diffusion models have demonstrated the ability to generate realistic images, often derived from a text prompt. However, in many cases we have a pre-existing real image which we wish to change just one aspect of to generate the desired outcome -- often referred to as single aspect image-to-image translation. There is no pre-existing tool which can perform this task directly, though people often build a pipeline which: i. generates both an image embedding and a prompt string which together would create an image as close as possible to the original image; ii. manipulating the prompt string to change the desired aspect -- this could be done by substitution in the prompt string before mapping it to an embedding space or first mapping to an embedding space before manipulating this embedding; and iii. using the updated prompt embedding and the image embedding with the cross-attention mechanism, from a diffusion model, in an attempt to generate a new image which changes just one aspect of the original image. However, currently this type of approach often leads to multiple aspects of the original image being changed. To overcome this we propose the addition of a new step-injection reconstruction function applied to the early stages of the denoising process to provide additional guidance for final image construction. We demonstrate that our approach compares favorably to state-of-the-art results beating other approaches in terms of the DINO-ViT structure distance metric and arguably producing images which are closer to the original image save from the one aspect change that we desire. We go further to identify short-comings in two of the most commonly used metrics (Clip Accuracy and DINO-ViT structure distance) and propose two new metrics which allow for better evaluation and understanding of the results.

Primary Area: generative models

Submission Number: 17442

Loading