Keywords: Deep learning, sequence models, softmax-free attention, hardware efficiency, linear-time attention, transformer architecture, efficient transformers, attention mechanisms, layer normalization, long context, long sequence modeling, Long Range Arena
TL;DR: We propose a novel Transformer-based architecture with a linear-time attention mechanism that consists of only MatMuls and is 1) faster and more computationally efficient than Transformer; 2) achieves similar or better performance.
Abstract: Transformers, despite empowering current AI revolution, are bottlenecked by suboptimal hardware utilization and quadratic runtime complexity of softmax attention w.r.t. input sequence length. Many recent architectures aspire to bring the complexity down to sub-quadratic level without compromising modeling quality. However, they are either much slower on all but very long sequences or rely on low-level code tailored to a narrow subset of modern hardware. To simultaneously achieve linear complexity, hardware efficiency, and portability, we completely eliminate softmax from self-attention; remove, modify, or rearrange other transformations in the Transformer block; and reduce number of attention heads. The resulting architecture, DenseAttention Network, is composed entirely of dense matrix multiplications in the attention which allows for efficient training and inference in both quadratic and linear modes. It performs similarly with standard Transformer in language modeling and surpasses previous Transformer-based SOTA by 5% on challenging Long Range Arena benchmarks. DenseAttention model written in plain PyTorch is up to 22% faster even on small context sizes, and by orders of magnitude on longer sequences, than Transformer augmented with low-level FlashAttention kernel.
Submission Number: 13
Loading