Axial Attention Transformer for Fast High-quality Image Style Transfer

Yuxin Liu, Wenxin Yu, Zhiqiang Zhang, Qi Wang, Lu Che

Published: 01 Jan 2024, Last Modified: 06 Jun 2025ISCAS 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Image style transfer aims to blend the content of one image with the style of another. Due to the limitation of traditional Convolutional Neural Network (CNN) methods in capturing global information, researchers have turned to vision transformers for image style transfer tasks, aiming to achieve a broader receptive field. However, the large computations of vision transformers lead to longer inference time compared to many conventional CNN-based methods, thereby constraining their practical application. To tackle this challenge, we propose an axial attention transformer encoder named AATE and design a fast vision transformer image style transfer model. Our model encompasses two AATEs that individually process content and style images, along with a transformer decoder to amalgamate style and content. In addition, we use a downsample module to preprocess the network inputs and an upsample module to refine the output. As a result, we improve the inference speed about 5 times faster than current state-of-the-art (SOTA) transformerbased methods while achieving excellent image quality. Qualitative and quantitative experiments show competitive results compared with advanced methods.