Keywords: versatile style transfer, content-style disentanglement
TL;DR: A general framework for image and text-guided style transfer
Abstract: Recent works in versatile style transfer have achieved impressive results in both content preservation and style fidelity. However, optimizing models solely with content and style losses often fails to match the real image distribution, leading to suboptimal stylization quality. In this paper, we propose a novel self-supervised framework, VST-SD, which disentangles content and style representations to enhance stylization performance. Specifically, we separate content and style from the input and train the model to reconstruct the original image. To facilitate effective disentanglement, we leverage feature statistics: a content encoder is designed with perturbation and compression to remove style-related statistics, while a style encoder employs magnitude preservation to capture style-specific information. A cascade of diffusion models are introduced to integrate content and style into new images. To support multi-modal capabilities in versatile style transfer, we construct a paired text-style dataset and design a pipeline enabling flexible, text-guided stylization. Experimental results across artistic, photorealistic, and text-guided stylization demonstrate the effectiveness and versatility of our approach.
Primary Area: generative models
Submission Number: 5162
Loading