MAG-Edit: Localized Image Editing in Complex Scenarios via $\underline{M}$ask-Based $\underline{A}$ttention-Adjusted $\underline{G}$uidance
Abstract: Recent diffusion-based image editing approaches have exhibited impressive editing capabilities in images with one dominant object in simple compositions. However, localized editing in images containing multiple objects and intricate compositions has not been well-studied in the literature, despite its growing real-world demands. Existing mask-based inpainting methods fall short of retaining the underlying structure within the edit region, causing noticeable discordance with their complex surroundings. Meanwhile, attention-based methods such as Prompt-to-Prompt (P2P) often exhibit editing leakage and misalignment in more complex compositions. In this work, we propose MAG-Edit, a plug-and-play, inference-stage optimization method, that empowers attention-based editing approaches, such as P2P, to enhance localized image editing in intricate scenarios. In particular, MAG-Edit optimizes the noise latent feature by encouraging two mask-based cross-attention ratios of the edit token, which in turn gradually enhances the local alignment with the desired prompt. Extensive quantitative and qualitative experiments demonstrate the effectiveness of our method in achieving both text alignment and structure preservation for localized editing within complex scenarios.
Primary Subject Area: [Generation] Generative Multimedia
Secondary Subject Area: [Content] Vision and Language
Relevance To Conference: Our research demonstrates a strong relevance to the conference's focus and themes, namely "Multimodal Fusion in Vision and Language" and "Multimedia in the Generative AI Era," by utilizing large-scale T2I diffusion model as a cross-modal framework for integrating vision and language in the domain of text-based image editing.
Leveraging the power of the pre-trained T2I diffusion model, we specifically address the challenges associated with localized image editing in complex scenarios that involve multiple objects and intricate compositions.
Our work provides valuable insights and practical solutions to meet real-world demands. Our proposed method, MAG-Edit, significantly enhances attention-based editing approaches like Prompt-to-Prompt (P2P) by optimizing noise latent features. This optimization technique incorporates two mask-based constraints, leading to improved text alignment and preservation of the structure within the edit region.
The application of MAG-Edit directly addresses the need for advanced text-based image editing in complex scenarios, aligning perfectly with the ACM MM community's focus on multimedia and multimodal processing. By emphasizing localized editing in this context, we make a substantial contribution to the ongoing development and advancement of multimedia and multimodal processing techniques.
Supplementary Material: zip
Submission Number: 554
Loading