Editing by Reconstruction: Background Preservation for Instruction-based Autoregressive Image Editing
Keywords: Instruction-based Image editing; autoregressive model
Abstract: Autoregressive (AR) editors have recently emerged as strong competitors to diffusion models for text-based image editing, yet they often introduce unintended changes in non-edited regions due to stochastic token sampling.
We present ERec (Editing by Reconstruction), a background-preservation method that synchronizes sampling between reconstruction and editing and requires no additional fine-tuning.
Concretely, we run a reconstruction path alongside the standard editing path and inject identical standard-Gumbel noise into both logits at every decoding step.
This Gumbel-max procedure is multinomial-equivalent, so it keeps diversity while coupling the two chains: when the logits are similar (typically in background regions), token choices align; when they differ (true edit regions), choices diverge and editability is retained.
After generation, a lightweight post-refinement localizes edits by combining distributional discrepancy with background confidence, followed by connectivity filtering and residual compositing to correct encoder quantization residuals.
ERec requires no fine-tuning of the baseline, integrates seamlessly with top-$k$ or nucleus sampling, and adds negligible inference overhead.
Experimental results show that it substantially improves background preservation while maintaining edit fidelity.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 6495
Loading