TokenDrop: Efficient Image Editing by Source Token Drop with Consistency Regularization

16 Sept 2025 (modified: 14 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Image Editing, Transformer, Regularized Sampling
Abstract: Text-based image editing has recently been reinterpreted in large multimodal transformers as conditional generation, where source image tokens are concatenated with text and noise tokens as conditioning inputs. While effective, this design introduces substantial computational overhead in attention layers. To mitigate this drawback, we present an efficient text-based image editing method called TokenDrop by dropping source tokens partially, where the selection of tokens to drop is adaptively guided by difference between the source and the clean estimate. Importantly, by reformulating the flow ODE as a latent optimization problem, we can reflect information of dropped tokens to the solution of regularized optimization. Thanks to the closed form solution, this optimization does not introduce additional computational cost. Across FluxKontext and Qwen-Image-Edit, our training-free method achieves an average 22.4\% improvement in inference speed on PIEBench, while better preserving non-edited regions. When the edited area is relatively small, the method delivers up to 1.8$\times$ speedup at 1024$^2$ resolution and 2$\times$ speedup at 2048$^2$ resolution.
Primary Area: generative models
Submission Number: 7532
Loading