PROMOTE: Prior-Guided Diffusion Model with Global-Local Contrastive Learning for Exemplar-Based Image Translation
Abstract: Exemplar-based image translation has garnered significant interest from researchers due to its broad applications in multimedia/multimodal processing. Existing methods primarily employ Euclidean-based losses to implicitly establish cross-domain correspondences between exemplar and conditional images, aiming to produce high-fidelity images. However, these methods often suffer from two challenges: 1) Insufficient excavation of domain-invariant features leads to low-quality cross-domain correspondences, and 2) Inaccurate correspondences result in errors propagated during the translation process due to a lack of reliable prior guidance. To tackle these issues, we propose a novel prior-guided diffusion model with global-local contrastive learning (PROMOTE), which is trained in a self-supervised manner. Technically, global-local contrastive learning is designed to align two cross-domain images within hyperbolic space and reduce the gap between their semantic correlation distributions using the Fisher-Rao metric, allowing the visual encoders to extract domain-invariant features more effectively. Moreover, a prior-guided diffusion model is developed that propagates the structural prior to all timesteps in the diffusion process. It is optimized by a novel prior denoising loss, mathematically derived from the transitions modified by prior information in a self-supervised manner, successfully alleviating the impact of inaccurate correspondences on image translation. Extensive experiments conducted across seven datasets demonstrate that our proposed PROMOTE significantly exceeds state-of-the-art performance in diverse exemplar-based image translation tasks.
Primary Subject Area: [Generation] Generative Multimedia
Secondary Subject Area: [Generation] Social Aspects of Generative AI
Relevance To Conference: Firstly, Exemplar-based Image Translation task allows for the conversion of images from one modality to another guided by exemplars, which facilitates cross-modal understanding and interaction in multimedia applications. Furthermore, the proposed global-local contrastive learning effectively aligns two cross-domain images in hyperbolic space and reduces the gap between their semantic correlation distributions using the Fisher-Rao metric, and the proposed prior-guided diffusion model with a novel prior denoising loss mathematically derived from the transitions modified by structural prior successfully alleviates the impact of inaccurate correspondences on image translation. This work provides both theoretical value and practical experience for multimedia/multimodal processing.
Supplementary Material: zip
Submission Number: 944
Loading