Attentive Linguistic Tracking in Diffusion Models for Training-free Text-guided Image Editing

Published: 01 Jan 2024, Last Modified: 13 Nov 2024ACM Multimedia 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Building on recent breakthroughs in diffusion-based text-to-image synthesis (TIS), training-free text-guided image editing (TIE) has emerged as an indispensable aspect of modern image editing practices. This technique involves the modification of features within attention layers to alter objects or their attributes within images during the generation process. Despite its utility, current image editing algorithms face challenges, particularly when editing multiple objects in an image. In this paper, we introduce VICTORIA, a novel approach that augments TIE by incorporating linguistic knowledge into the manipulation of attention maps during image generation. VICTORIA capitalizes on mechanisms within self-attention layers to ensure spatial consistency between source and target images. Further, we design a novel loss function that refines cross-attention maps, ensuring their alignment with linguistic constraints, thereby enhancing the editing precision of multiple target objects. We also present a linguistic mask blending technique that aids in the retention of information in regions not subjected to modification. Experimental results across seven diverse datasets show that VICTORIA achieves significant improvements over state-of-the-art methods. Our work underscores the critical role and effectiveness of linguistic analysis in elevating the performance of TIE, with a specific emphasis on multi-object scenarios. The code is available at https://github.com/alibaba/EasyNLP/tree/master/diffusion/VICTORIA.
Loading