Abstract: Image style transfer aims to blend the content of one image with the style of another. Due to the limitation of traditional Convolutional Neural Network (CNN) methods in capturing global information, researchers have turned to vision transformers for image style transfer tasks, aiming to achieve a broader receptive field. However, the large computations of vision transformers lead to longer inference time compared to many conventional CNN-based methods, thereby constraining their practical application. To tackle this challenge, we propose an axial attention transformer encoder named AATE and design a fast vision transformer image style transfer model. Our model encompasses two AATEs that individually process content and style images, along with a transformer decoder to amalgamate style and content. In addition, we use a downsample module to preprocess the network inputs and an upsample module to refine the output. As a result, we improve the inference speed about 5 times faster than current state-of-the-art (SOTA) transformerbased methods while achieving excellent image quality. Qualitative and quantitative experiments show competitive results compared with advanced methods.
Loading