Attentive Linguistic Tracking in Diffusion Models for Training-free Text-guided Image Editing

Published: 20 Jul 2024, Last Modified: 01 Aug 2024MM2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Building on recent breakthroughs in diffusion-based text-to-image synthesis (TIS), training-free text-guided image editing (TIE) has become an indispensable aspect of modern image editing practices. It involves modifying the features in attention layers to alter objects or their attributes within images during the generation process. Yet, current image editing algorithms still present certain difficulties and challenges when it comes to editing multiple objects within an image. In this paper, we propose VICTORIA, a novel approach that enhances TIE by incorporating linguistic knowledge when manipulating attention maps during image generation. VICTORIA leverages components within self-attention layers to maintain spatial consistency between source and target images. Additionally, a novel loss function is designed to refine cross-attention maps, ensuring their alignment with linguistic constraints and enhancing the editing of multiple target entities. We also introduce a linguistic mask blending technique to improve the retention of information in areas exempt from modification. Experimental results across seven diverse datasets demonstrate that VICTORIA achieves substantial enhancements over state-of-the-art methods. This work highlights the critical role and effectiveness of linguistic analysis in boosting the performance of TIE.
Primary Subject Area: [Content] Vision and Language
Secondary Subject Area: [Generation] Generative Multimedia
Relevance To Conference: The research content of this paper focuses on image editing tasks based on text-to-image generation models, which are achieved through text-guided editing techniques. Text-guided image editing is an important branch in the field of multimedia research, demonstrating the integration of image processing and natural language processing technologies and driving the development of multimedia research. Our study includes understanding user text instructions, enabling automatic image editing and adjustments through text-to-image generation models, such as altering colors and shapes in images or adding and removing specific elements.
Submission Number: 1171
Loading