Abstract: Arbitrary style transfer (AST) plays a pivotal role in image processing, as it can impart the stylistic characteristics of a reference image onto a chosen target content image. However, existing AST methods based on convolutional neural networks (CNNs) often suffer from loss of content details and distortion of content structures. While methods based on vision transformers (ViTs) address these issues, they often struggle to effectively extract local features and suffer from low efficiency. To overcome these challenges, we propose a lightweight ViT for AST (LVAST). The network primarily relies on local representation blocks to extract features and uses separable self-attention in global representation blocks to model global information in features. Furthermore, we propose a content semantic contrastive loss function, which significantly enhances the content consistency between the content and the stylized images. Extensive experimental results demonstrate that the proposed LVAST outperforms CNN-based methods in terms of visual quality and achieves 2–3 times faster inference speed than ViT-based methods, while still producing visually comparable results.
External IDs:dblp:journals/tjs/YangYWFZ25
Loading