[Re]: On the Relationship between Self-Attention and Convolutional Layers

Mukund Varma T; Nishant Satish Prabhu

[Re]: On the Relationship between Self-Attention and Convolutional Layers

Mukund Varma T, Nishant Satish Prabhu

31 Jan 2021 (modified: 05 May 2023)ML Reproducibility Challenge 2020 Blind SubmissionReaders: Everyone

Keywords: Self Attention, Transformers, Image Classification

Abstract: \subsection*{Scope of Reproducibility} In this report, we perform a detailed study on the paper "On the Relationship between Self-Attention and Convolutional Layers" \cite{attncnn} which provides theoretical and experimental evidence that self-attention layers can behave like convolutional layers. The proposed method does not obtain state-of-the-art performance but rather answers an interesting question - \textit{do self-attention layers process images in a similar manner to convolutional layers?}. This has inspired many recent works like \cite{zhao2020exploring, dosovitskiy2020image} which propose a fully-attentional model for image recognition. We focus on experimentally validating the claims of the original paper, highlight differences with other similar works, and propose a new variant of the attention operation - {\em Hierarchical Attention} which shows improved performance with significantly lesser parameters. To facilitate further study, all the code used in our experiments are publicly available here\footnote{Code available in the supplementary material during the review period.}. \subsection*{Methodology} We implement the original paper \cite{attncnn} from scratch in Pytorch and refer to the author's source code\footnote{\url{https://github.com/epfml/attention-cnn}} for verification. In our experiments involving SAN \cite{zhao2020exploring}, we utilize the official implementation\footnote{\url{https://github.com/hszhao/SAN}} due to the available faster CUDA kernels while we implement VIT \cite{dosovitskiy2020image} from scratch referring to the author's source code\footnote{\url{https://github.com/google-research/vision_transformer}}. We then incorporate our proposed hierarchical operation in all three methods for comparison. For all our experiments mentioned in this report, we use the CIFAR10 dataset to benchmark the performance of the model in the image classification task. Each training run required around 20 hours while the corresponding hierarchical versions took around 10-12 hours for convergence in an Nvidia RTX 2060 GPU. \subsection*{Results} We were able to reproduce all the results from the paper within 1\% of the reported value, hence validating the claims of the original paper. However, there seem to be some differences in the attention figures which lead to interesting insights and the proposed Hierarchical Attention. In the case of VIT and SAN, we do not have a comparative baseline as the corresponding papers do not evaluate performance on the CIFAR10 dataset (without pre-training). \subsection*{What was easy} We did not face any major challenges in reproducing the results in the paper. \subsection*{What was difficult} Most of the code in the official implementation seems to be borrowed from HuggingFace's repository\footnote{\url{https://github.com/huggingface/transformers}} which also brought along a lot of unnecessary code making it difficult to read and understand quickly. Further, the training time for each run is quite significant making it difficult for us to experiment with multiple datasets and hyperparameter settings. \subsection*{Communication with original authors} We have tried contacting the authors regarding the differences in the attention figures since the code for the same was not available on the repository for verification. However, we have not received any response regarding the same.

Paper Url: https://openreview.net/pdf?id=HJlnC1rKPB

Supplementary Material: zip

4 Replies

Loading