Keywords: diffusion models, image editing, scene-centric editing, feedback, Pareto optimization
TL;DR: A feedback-driven diffusion framework that adaptively adjusts conditioning layer by layer to balance structural preservation and semantic alignment.
Abstract: Although text-guided image editing (TIE) has advanced rapidly, most prior works remain object-centric and rely on attention maps or masks to localize and modify specific objects. In this paper, we propose a method of Editing via Dynamic Interactive Tuning (EDIF) that adaptively trades off source-image structure and instruction fidelity in difficult scene-centric editing settings. Unlike object editing, scene-centric editing is challenging because the target cannot be clearly localized, and edits need to preserve global structure. To cope with the limitation of TIE systems that typically use a unified conditioning signal and ignore the block-wise variation in the internal behavior of the model, we show that inside the model, the source-image condition and the text-prompt embedding act with layer-dependent directions and strengths. We also demonstrate both empirically and the oretically that the editing state can be diagnosed using the source image signal-to-noise ratio and VLM logits, which indicate whether the edited image faithfully reflects the intended editing prompt. By constructing a Pareto line between these two objectives, EDIF adaptively modulates the source-image and editing-text conditions, guiding each denoising step to stay close to this line for balanced optimization. Extensive experiments on ImgEdit, EmuEdit-Bench, and Places365 show that EDIF achieves state-of-the-art performance in various scene-editing scenarios, including indoor and outdoor environments.
Supplementary Material: pdf
Primary Area: generative models
Submission Number: 2331
Loading