PCSViT: Efficient and hardware friendly Pyramid Vision Transformer with channel and spatial self-attentions
Abstract: Vision Transformers (ViT) have been widely used in various visual tasks and have achieved great success due to their advantageous self-attention mechanism. However, most ViT models focus primarily on spatial self-attention, often overlooking the importance of channel attention. In this paper, we propose a channel self-attention module as a complementary addition to the standard self-attention module in ViTs. Then, we introduce an adaptive feed-forward network designed for different attention modules. Based on the proposed self-attention module and adaptive feed-forward network, we propose a flexible Vision Transformer with channel and spatial attentions (CSViT) and conduct a series of experiments to explore the optimal position of different attention modules. Additionally, we introduce PCSViT, which combines the strengths of CSViT and convolutional neural networks (CNNs). PCSViT features a pyramid architecture and incorporates local spatial attention, global spatial attention, and channel attention. We further explore hardware-friendly designs to efficiently implement and accelerate PCSViT on embedded devices. The performance of the proposed methods is evaluated on small datasets CIFAR and Fashion-MNIST, as well as the larger dataset ImageNet. Experimental results show that the proposed model reduces ViT’s reliance on large datasets and outperforms several lightweight state-of-the-art CNN and ViT models across a range of model sizes. The hardware-friendly designs achieve about 10% acceleration on a RISC-V CPU.
Loading