Surely but quickly: Quick convergence using Self Sampling Attention

TMLR Paper3198 Authors

16 Aug 2024 (modified: 30 Oct 2024)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Transformer is a very versatile architecture used in fields like computer vision, natural language processing, and audio signal processing. However, its data-hungry nature and the quadratic complexity of the Self-Attention(SA) layer make it slow and costly to train. Solutions like self-attention alternatives and distillation aim to reduce the computational complexity of training Transformers, but convergence remains costly and challenging. As shown by Liu et al. (2023), the amplification effect makes it hard for a Transformer using traditional SA mechanism to find the attention quickly. Therefore, in this work, we propose GauTransformer, a dual-phase learning model. In the first phase, it uses a stochastic approach based on sampling the key tokens with learned Gaussian distributions to find optimal attention maps faster than the standard Self-Attention layer. In the second phase, it accelerates transformer convergence through distillation. We demonstrate that the GauAttention module can be a powerful mechanism to achieve competitive performance while decreasing the computational cost of training. Furthermore, in many settings, we empirically show that GauAttention can reduce the training time to half the number of steps compared to the traditional method of training transformer architectures.
Submission Length: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: 1. Better motivation behind the idea of Gaussian Attention 2. Improvements on figure 1 3. More robust experimentation process on CIFAR-10 4. Better results on ImageNet 5. Addition of a new baseline 6. Addition of two prior works in the Related Works Section 7. Benchmark on training time
Assigned Action Editor: ~Mingsheng_Long2
Submission Number: 3198
Loading