DINO-ViT Enhanced Diffusion for Multi-exemplar-based Image Translation

Published: 01 Jan 2024, Last Modified: 25 Oct 2024IJCNN 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: We have developed a framework for multi-exemplar-based image translation. Most exemplar-based image translation methods allow only one target image for appearance transfer. In these existing methods, GANs are often used as generators, and features extracted from pre-trained CNNs are used as visual descriptors. By comparison, our framework allows users to provide one or more images as exemplars to realize the appearance transfer of different objects while preserving the structure of the source image. Methods based on GANs are typically limited to a specific domain. To overcome this, we choose a diffusion model as the generator in our framework, allowing image translation across arbitrary domains. DINO-ViT, as a visual transformer trained by self-supervision, its deep features have rich semantic properties and visual information compared to CNNs’. Therefore, we use the deep features extracted from pretrained DINO-ViT as visual descriptors to achieve more accurate image translation. The naive combination of translation results from different exemplars may lead to visual disharmony due to rough edges and differences in object visual characteristics. To address this, we introduce small amounts of noise on the combined image to reduce inconsistencies and use the reverse process of diffusion to denoise, then we can obtain an image that is almost identical to the combined image but without artifacts. Our framework offers higher-quality image translation and more flexible image editing, and we have demonstrated the effectiveness and superiority of our approach in several image translation tasks.
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview