ViT Graph Head Attention for Small Sized Datasets
Keywords: Vision Transformer, Graph Neural Network, Vision Graph, Small data efficient
TL;DR: Propose an efficient model for small and medium
Abstract: In this paper, we propose a new type of vision transformer (ViT) based on a graph head attention (GHA). The GHA creates the graph structure using an attention map generated from the input patches. Because the attention map represents the degree of concentration between image patches, it can be regarded as a type of relationship between patches, which can be converted into a graph structure. To maintain an MHA-like performance with fewer GHAs, we apply a graph attention network to the GHA to ensure attention diversity and emphasize the correlations between graph nodes. The proposed GHA maintains both the locality and globality of the input patches and guarantees diversity of attention. The proposed GHA-ViT commonly outperforms pure ViT-based models on small-sized and a medium-sized ImageNet-1K dataset through scratch training. A top-1 accuracy of 81.7\% was achieved in ImageNet-1K with GHA-B, which is a base model with approximately 29M parameters.
Submission Number: 9