Editing by Reconstruction: Background Preservation for Instruction-based Autoregressive Image Editing

16 Sept 2025 (modified: 14 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Instruction-based Image editing; autoregressive model
Abstract: Autoregressive (AR) editors have recently emerged as strong competitors to diffusion models for text-based image editing, yet they often introduce unintended changes in non-edited regions due to stochastic token sampling. We present ERec (Editing by Reconstruction), a background-preservation method that synchronizes sampling between reconstruction and editing and requires no additional fine-tuning. Concretely, we run a reconstruction path alongside the standard editing path and inject identical standard-Gumbel noise into both logits at every decoding step. This Gumbel-max procedure is multinomial-equivalent, so it keeps diversity while coupling the two chains: when the logits are similar (typically in background regions), token choices align; when they differ (true edit regions), choices diverge and editability is retained. After generation, a lightweight post-refinement localizes edits by combining distributional discrepancy with background confidence, followed by connectivity filtering and residual compositing to correct encoder quantization residuals. ERec requires no fine-tuning of the baseline, integrates seamlessly with top-$k$ or nucleus sampling, and adds negligible inference overhead. Experimental results show that it substantially improves background preservation while maintaining edit fidelity.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 6495
Loading