Efficient Self-Guided Editing for Text-Driven Image-to-Image Translation

18 Sept 2025 (modified: 19 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Diffusion Model, Image-to-Image Translation
Abstract: Diffusion-based generative models achieve impressive text-driven image synthesis, largely due to classifier-free guidance (CFG), which enhances semantic alignment through blending conditional and unconditional denoising predictions. However, in text-guided image editing, CFG frequently induces structural drift, with the unconditional branch generating spatial mismatches. Prior approaches mitigate this by adding a reconstruction branch to steer the unconditional predictions, yet this consumes substantial GPU memory and computational resources. Our empirical studies uncover the inherent trade-off between semantic accuracy and structural integrity, pinpointing the null-text branch as the key culprit. We introduce a Target-Guided Unconditional Branch that repurposes semantic cues from the target prompt and initial denoising inputs from the source image to ensure spatial consistency. This method delivers superior editing quality without extra computational burden, serving as an efficient substitute for traditional CFG-dependent editing methods. Our experiments on PIE-Bench demonstrate that our method outperforms state-of-the-art baselines in structure preservation and background retention while maintaining comparable semantic alignment, all with reduced inference time and GPU memory usage.
Supplementary Material: zip
Primary Area: transfer learning, meta learning, and lifelong learning
Submission Number: 10680
Loading