DiffHarmony++: Enhancing Image Harmonization with Harmony-VAE and Inverse Harmonization Model

Pengfei Zhou; Fangxiang Feng; Guang Liu; Ruifan Li; Xiaojie Wang

DiffHarmony++: Enhancing Image Harmonization with Harmony-VAE and Inverse Harmonization Model

Pengfei Zhou, Fangxiang Feng, Guang Liu, Ruifan Li, Xiaojie Wang

Published: 20 Jul 2024, Last Modified: 21 Jul 2024MM2024 PosterEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Latent diffusion model has demonstrated impressive efficacy in image generation and editing tasks. Recently, it has also promoted the advancement of image harmonization. However, methods involving latent diffusion model all face a common challenge: the severe image distortion introduced by the VAE component, while image harmonization is a low-level image processing task that relies on pixel-level evaluation metrics. In this paper, we propose Harmony-VAE, leveraging the input of the harmonization task itself to enhance the quality of decoded images. The input involving composite image contains the precise pixel level information, which can complement the correct foreground appearance and color information contained in denoised latents. Meanwhile, the inherent generative nature of diffusion models makes it naturally adapt to inverse image harmonization, i.e. generating synthetic composite images based on real images and foreground masks. We train an inverse harmonization diffusion model to perform data augmentation on two subsets of iHarmony4 and construct a new human harmonization dataset with prominent foreground objects. Extensive experiments demonstrate the effectiveness of our proposed Harmony-VAE and inverse harmonization model. The code, pretrained models and the new dataset will be made publicly available.

Primary Subject Area: [Generation] Generative Multimedia

Secondary Subject Area: [Experience] Multimedia Applications

Relevance To Conference: Image harmonization is a task mainly involving the image modality, yet the foundational model utilized in this paper is the typical multimodal model, i.e. stable diffusion. This paper demonstrates how to adapt stable diffusion to tackle the task of image harmonization, offering new perspectives on the application of multimodal pretrained generative models in image-to-image translation tasks. It expands the application scope of the pretrained text-to-image models. Moreover, in the past few years, the ACM MM conference has published several papers on the image harmonization task.

Supplementary Material: zip

Submission Number: 4112

Loading