Reviewed Version (pdf): https://openreview.net/references/pdf?id=MVjy0lhIJk
Keywords: Transformers, attention, efficient
Abstract: We introduce Attention Free Transformer (AFT), an efficient variant of Transformers \citep{transformer} that eliminates the need for dot product attention. AFT offers great simplicity and efficiency compared with standard Transformers, where the multi-head attention operation is replaced with the composition of element-wise multiplications/divisions and global/local pooling. During training time, AFT has linear time and space complexity w.r.t. both the sequence length and feature dimension; in the autoregressive decoding mode, AFT has constant memory and time complexity per step. We show that, surprisingly, we are able to train AFT effectively on challenging benchmarks, and also to match or surpass the standard Transformer counterparts and other efficient variants. In particular, AFT achieves the state-of-the-art result on CIFAR10 autoregressive modeling with much reduced complexity, and also outperforms several efficient Transformer variants on Enwik8.
One-sentence Summary: We propose an efficient Transformer that eliminates attention.
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics
Community Implementations: [ 2 code implementations](https://www.catalyzex.com/paper/arxiv:2105.14103/code)
10 Replies
Loading