Blog post is based on the paper “An image is worth 16x16 words: Transformers for image recognition at scale.” by Kolesnikov, Alexander, et al. at ICLR 2021.
In last 10 years there has been significant development in Computer vision after development of Convolutional neural network(ConvNets). Albeit development of Convolutional neural networks dates back to 1980s when the idea of connectionism or parellel distributed processing (Rumelhart et al., 1986e; McClelland et. al., 1995) came into view during second wave of neural network research. Many ideas were revived from the work of pschologist Donald Hebb (Hebb, 1949). The main point in connectionism is that many simple computing units can achieve intelligent network behavior. An individual unit or small set of units is of little use. Due to connectionism, several concepts came to light during 1980s which remain in focus in today’s deep learning such as distributed representation, backpropogation to train deep neural network.1
After transformers where introduced in NLP in 2017 replacing recurrent neural network, many architectures involving self-attention(transformer model) has made progress in area of time series forecasting4, graph based model3, visual recognition system2. Below figure is from Attention, Please! A Survey of Neural Attention Models in Deep Learning represents developments of Attention in Deep learning.
Figure 1: Timeline of work related to attention in Deep learning.
Convolutional networks have dominated in computer vision tasks. But there has been research around combining self-attention with CNN after success seen in NLP with transformer models. Below graph shows work related to transformers in vision models.
Figure 2: Related work with transformer in vision model.
Authors of ViT follows original transformer architecture (Vaswani et al., 2017) for image recognition.
Figure 3: Model architecture.
As seen in above Figure 3, Transformer encoder consists of Multi-head attention layer and MLP layer one after the other. Normalization is applied before each layer since it helps to reduce number of steps needed by gradient descent to optimize the network and when we normalize scale of output is going to be same.
ViT has significantly less image-specific inductive bias compared to the convolutional neural network. In ViT MLP layers are local and translational equivariant whereas self attention layers are global. Positional encoding in ViT induces inductive bias- It is shown in paper that pre-training is done on 224 dimension and fine-tuning is done on higher resolution i.e 384x384, so positional encoding of 16x16 patch is no more useful since number of patches increases and sequence becomes larger. So 2D interpolation is done for positional encoding and inductive bias is introduced.
There are 3 variants of ViT: ViT-Base, ViT-Large and ViT-huge. ViT-Large uses 16x16 patch and ViT-huge uses 14x14 patch.
Baseline model considered is modified Resnet(BiT)-by replacing Batch Normalization layers with Group Normalization and standardized convolutions.
Above figure shows performance of ViT variants with state of the art models. Results of ViT shown here was pretrained on JFT-300M dataset and it outperforms all the state of art models while take less computational resource on pre-training.
VTAB benchmark result shows breakdown on 3 type of data: Natural(1000 training examples per task, CIFAR), Specialized(medical and satellite images) and structured(tasks which require geometrical understanding) task groups and ViT outperformes previous SOTA.
Vision Transform is trained on increasing size of datasets: ImageNet, ImageNet-21k and JFT-300 M. As shown in above figure when all the model variants pre-trained with ImageNet, ViT-large underperforms. And when it come to ImageNet-21k the performance of ViT-large is similar to BiT(Resnet). But with JFT-300 M, ViT-large model performs better. The BiT CNNs outperforms ViT on ImageNet, but with the larger dataset ViT over takes. They also did an experiment on the training model at random subsets of 9M, 30M and 90M and at last on full JFT-300M.
The first layer of the vision transformer lowers the dimension of the patch. So, Figure 4 (left) show the top principal component of the linear embedding filter.
Figure 4 (centre) shows that the model learns to encode distance within the image in the similarity of position embeddings, as you see the closer patches tend to have more similar position embeddings
Self Attention allows integrating information across the entire image even in the lowest layers. So the Figure (right) shows the “attention distance” which is similar to a receptive field in CNNs. This is the figure of ViT-L/32 variant, where it has 24 layers and 16 attention heads for each layer. Some attention heads in the 0 layer have global attention and some have local attention.