EditMGT: Unleashing Potentials of Masked Generative Transformers in Image Editing

02 Sept 2025 (modified: 12 Feb 2026)ICLR 2026 Conference Desk Rejected SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Image Edit; Generative AI
TL;DR: We introduce EditMGT, the first Masked Generative Transformer for image editing that uses adaptive localized token flipping to achieve precise edits while preserving non-target regions.
Abstract: Recent advances in diffusion models (DMs) have achieved exceptional visual quality in image editing tasks. However, the global denoising dynamics of DMs inherently conflate local editing targets with the full-image context, leading to unintended modifications in non-target regions. In this paper, we shift our attention beyond DMs and turn to Masked Generative Transformers (MGTs) as an alternative approach to tackle this challenge. By predicting multiple masked tokens rather than holistic refinement, MGTs exhibit a localized decoding paradigm that endows them with the inherent capacity to explicitly preserve non-relevant regions during the editing process. Building upon this insight, we introduce the first MGT-based image editing framework, termed EditMGT. We first demonstrate that MGT's cross-attention maps provide informative localization signals for localizing edit-relevant regions and devise a *multi-layer attention consolidation* scheme that refines these maps to achieve fine-grained and precise localization.On top of these adaptive localization results, we introduce *region-hold sampling*, which restricts token flipping within low-attention areas to suppress spurious edits, thereby confining modifications to the intended target regions and preserving the integrity of surrounding non-target areas. To train EditMGT, we construct Crisp-2M, a high-resolution ($\geq$1024) dataset spanning seven diverse editing categories. Without introducing additional parameters, we adapt a pre-trained text-to-image MGT into an image editing model through attention injection. Extensive experiments across four standard benchmarks demonstrate that, with fewer than $1$B parameters, our model achieves state-of-the-art image similarity performance while enabling $6\times$ faster editing. Moreover, it delivers comparable or superior editing quality, with improvements of $3.6$% and $17.6$% on style change and style transfer tasks, respectively. More information can be found from the Anonymous Page: [https://anoy1314.github.io](https://anoy1314.github.io).
Supplementary Material: zip
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 696
Loading