DAFT-GAN: Dual Affine Transformation Generative Adversarial Network for Text-Guided Image Inpainting

Jihoon Lee; Yunhong Min; Hwidong Kim; Sangtae Ahn

DAFT-GAN: Dual Affine Transformation Generative Adversarial Network for Text-Guided Image Inpainting

Jihoon Lee, Yunhong Min, Hwidong Kim, Sangtae Ahn

Published: 20 Jul 2024, Last Modified: 21 Jul 2024MM2024 PosterEveryoneRevisionsBibTeXCC BY 4.0

Abstract: In recent years, there has been a significant focus on research related to text-guided image inpainting, which holds a pivotal role in the domain of multimedia processing. This has resulted in notable enhancements in the quality and performance of the generated images. However, the task remains challenging due to several constraints, such as ensuring alignment between the generated images and the accompanying text, and maintaining consistency in distribution between corrupted and uncorrupted regions, for achieving natural and fine-grained image generation. To address these challenges, previous studies developed novel architectures, inpainting techniques, or objective functions but they still lack semantic consistency between the text and generated images. In this paper, thus, we propose a dual affine transformation generative adversarial network (DAFT-GAN) to maintain the semantic consistency for text-guided inpainting. DAFT-GAN integrates two affine transformation networks to combine text and image features gradually for each decoding block. The first affine transformation network leverages global features of the text to generate coarse results, while the second affine network utilizes attention mechanisms and spatial of the text to refine the coarse results. By connecting the features generated from these dual paths through residual connections in the subsequent block, the model retains information at each scale while enhancing the quality of the generated image. Moreover, we minimize information leakage of uncorrupted features for fine-grained image generation by encoding corrupted and uncorrupted regions of the masked image separately. Through extensive experiments, we observe that our proposed model outperforms the existing models in both qualitative and quantitative assessments with three benchmark datasets (MS-COCO, CUB, and Oxford) for text-guided image inpainting.

Primary Subject Area: [Generation] Generative Multimedia

Secondary Subject Area: [Content] Multimodal Fusion

Relevance To Conference: Our research addresses a fundamental challenge in multimedia processing: text-guided image inpainting. This area is at the intersection of computer vision, natural language processing, and multimedia content generation, making it highly relevant to the diverse audience of the ACM Multimedia conference. Our paper focuses on improving the quality and performance of generated images, which aligns with the conference's emphasis on advancing techniques for multimedia content creation and enhancement. By introducing a dual affine transformation generative adversarial network (DAFT-GAN), our work contributes a novel approach to addressing the previous challenges, attracting attention from researchers interested in cutting-edge methods for multimedia processing. The proposed method integrates insights from multiple disciplines, including computer vision, machine learning, and image processing, to tackle the complex task of text-guided image inpainting. Such interdisciplinary research is encouraged and celebrated within the ACM Multimedia community, fostering collaboration and knowledge exchange across different fields. The observed improvements in both qualitative and quantitative assessments provide empirical evidence of the effectiveness of DAFT-GAN, strengthening its applicability in real-world multimedia applications. Overall, the presented work offers valuable contributions to the field of multimedia processing thus it is highly relevant to consideration for presentation and discussion at the ACM Multimedia conference.

Supplementary Material: zip

Submission Number: 5413

Loading