PROMOTE: Prior-Guided Diffusion Model with Global-Local Contrastive Learning for Exemplar-Based Image Translation

Guojin Zhong; YIHU GUO; Jin Yuan; Qianjun Zhang; Weili Guan; Long Chen

PROMOTE: Prior-Guided Diffusion Model with Global-Local Contrastive Learning for Exemplar-Based Image Translation

Guojin Zhong, YIHU GUO, Jin Yuan, Qianjun Zhang, Weili Guan, Long Chen

Published: 20 Jul 2024, Last Modified: 21 Jul 2024MM2024 PosterEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Exemplar-based image translation has garnered significant interest from researchers due to its broad applications in multimedia/multimodal processing. Existing methods primarily employ Euclidean-based losses to implicitly establish cross-domain correspondences between exemplar and conditional images, aiming to produce high-fidelity images. However, these methods often suffer from two challenges: 1) Insufficient excavation of domain-invariant features leads to low-quality cross-domain correspondences, and 2) Inaccurate correspondences result in errors propagated during the translation process due to a lack of reliable prior guidance. To tackle these issues, we propose a novel prior-guided diffusion model with global-local contrastive learning (PROMOTE), which is trained in a self-supervised manner. Technically, global-local contrastive learning is designed to align two cross-domain images within hyperbolic space and reduce the gap between their semantic correlation distributions using the Fisher-Rao metric, allowing the visual encoders to extract domain-invariant features more effectively. Moreover, a prior-guided diffusion model is developed that propagates the structural prior to all timesteps in the diffusion process. It is optimized by a novel prior denoising loss, mathematically derived from the transitions modified by prior information in a self-supervised manner, successfully alleviating the impact of inaccurate correspondences on image translation. Extensive experiments conducted across seven datasets demonstrate that our proposed PROMOTE significantly exceeds state-of-the-art performance in diverse exemplar-based image translation tasks.

Primary Subject Area: [Generation] Generative Multimedia

Secondary Subject Area: [Generation] Social Aspects of Generative AI

Relevance To Conference: Firstly, Exemplar-based Image Translation task allows for the conversion of images from one modality to another guided by exemplars, which facilitates cross-modal understanding and interaction in multimedia applications. Furthermore, the proposed global-local contrastive learning effectively aligns two cross-domain images in hyperbolic space and reduces the gap between their semantic correlation distributions using the Fisher-Rao metric, and the proposed prior-guided diffusion model with a novel prior denoising loss mathematically derived from the transitions modified by structural prior successfully alleviates the impact of inaccurate correspondences on image translation. This work provides both theoretical value and practical experience for multimedia/multimodal processing.

Supplementary Material: zip

Submission Number: 944

Loading