Break Stylistic Sophon: Are We Really Meant to Confine the Imagination in Style Transfer?

Song Yan; Yusen Zhang; Jingyu Zhao; Hao Zhang; Yang Zhangping; GuanYe Xiong; Chengliang Zhong; Yanfei Liu; Yunwei Lan; Tao Zhang; YuJie He; Siyuan Tian; Min Li

Break Stylistic Sophon: Are We Really Meant to Confine the Imagination in Style Transfer?

Song Yan, Yusen Zhang, Jingyu Zhao, Hao Zhang, Yang Zhangping, GuanYe Xiong, Chengliang Zhong, Yanfei Liu, Yunwei Lan, Tao Zhang, YuJie He, Siyuan Tian, Min Li

02 Sept 2025 (modified: 21 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Diffusion Model

Abstract: In the realm of image style transfer, existing algorithms relying on single reference style images encounter formidable challenges, such as severe semantic drift, overfitting, color limitations, and a lack of a unified framework. These issues impede the generation of high quality, diverse, and semantically accurate images. In this study, we introduce StyleWallfacer, an innovative unified training and inference framework, which not only addresses various issues encountered in the style transfer process of traditional methods but also unifies the framework for different tasks. This framework is designed to advance the development of this field by enabling high-quality style transfer and text driven stylization. First, we propose a semantic-based style injection method that uses BLIP to generate text descriptions strictly aligned with the semantics of the style image in CLIP space. By leveraging a large language model to remove style-related descriptions from these descriptions, we create a semantic gap. This gap is then used to fine-tune the model, enabling efficient and drift-free injection of style knowledge. Second, we propose a data augmentation strategy based on human feedback, incorporating high-quality samples generated early in the fine-tuning process into the training set to facilitate progressive learning and significantly reduce its overfitting. Finally, we design a training-free triple diffusion process using the fine-tuned model, which manipulates the features of self-attention layers in a manner similar to the cross-attention mechanism. Specifically, in the generation process, the key and value of the content-related process are replaced with those of the style-related process to inject style while maintaining text control over the model. We also introduce query preservation to mitigate disruptions to the original content. Under such a design, we have achieved high-quality image-driven style transfer and text-driven stylization, delivering high-quality style transfer results while preserving the original image content. Moreover, we achieve image color editing during the style transfer process for the first time, further pushing the boundaries of controllable image generation and editing technologies and breaking the limitations imposed by reference images on style transfer. Our experimental results demonstrate that our proposed method outperforms state-of-the-art methods.

Primary Area: generative models

Submission Number: 785

Loading