Abstract: Transformer is a very versatile architecture used in fields like computer vision, natural
language processing, and audio signal processing. However, its data-hungry nature and
the quadratic complexity of the Self-Attention(SA) layer make it slow and costly to train.
Solutions like self-attention alternatives and distillation aim to reduce the computational
complexity of training Transformers, but convergence remains costly and challenging. As
shown by Liu et al. (2023), the amplification effect makes it hard for a Transformer
using traditional SA mechanism to find the attention quickly. Therefore, in this work,
we propose GauTransformer, a dual-phase learning model. In the first phase, it uses a
stochastic approach based on sampling the key tokens with learned Gaussian distributions
to find optimal attention maps faster than the standard Self-Attention layer. In the second
phase, it accelerates transformer convergence through distillation. We demonstrate that the
GauAttention module can be a powerful mechanism to achieve competitive performance
while decreasing the computational cost of training. Furthermore, in many settings, we
empirically show that GauAttention can reduce the training time to half the number of steps
compared to the traditional method of training transformer architectures.
Submission Length: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: 1. Better motivation behind the idea of Gaussian Attention
2. Improvements on figure 1
3. More robust experimentation process on CIFAR-10
4. Better results on ImageNet
5. Addition of a new baseline
6. Addition of two prior works in the Related Works Section
7. Benchmark on training time
Assigned Action Editor: ~Mingsheng_Long2
Submission Number: 3198
Loading