Adaptive Projected Guidance for Controllable Instruction-Guided Image Editing

Adaptive Projected Guidance for Controllable Instruction-Guided Image Editing

ICLR 2026 Conference Submission20466 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: instruction-guided image editing, diffusion models

TL;DR: We propose a plug-and-play framework to improve fidelity and textual adherence in instruction-guided image editing models.

Abstract: Instruction-guided diffusion models have demonstrated strong capabilities in generating targeted image edits based on diverse textual prompts. A fundamental challenge in this setting is achieving the right balance between adhering to textual instructions and preserving the original content of the input image. InstructPix2Pix (IP2P) addresses this by applying separate classifier-free guidance (CFG) terms to the text and image conditions, each scaled independently. However, this limited parametrization restricts user control, as increasing one guidance scale often causes the corresponding condition to dominate the output, resulting in imbalanced edits. Independently, Adaptive Projected Guidance (APG) was recently introduced to mitigate inherit limitations of CFG at high guidance scales in text- and class-conditioned diffusion models, reframing CFG as a gradient ascent process with decomposed guidance directions and improved signal control. In this work, we present IP2P-APG, a plug-and-play extension of IP2P that repurposes APG to improve the balance between instruction adherence and content preservation in image editing tasks. IP2P-APG significantly expands the controllable parameter space, allowing users to have more precise control over the editing process. Moreover, by enabling the use of higher guidance scales without introducing artifacts or compromising fidelity to the original content, IP2P-APG achieves a more effective trade-off between textual alignment and content preservation. Extensive experiments across multiple generative backbones and datasets demonstrate that our method consistently produces more realistic and instruction-faithful edits, without additional training and with negligible computational overhead. Code will be released after the review process.

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 20466

Loading