An Attention Free Transformer

Shuangfei Zhai; Walter Talbott; Nitish Srivastava; Chen Huang; Hanlin Goh; Joshua M. Susskind

An Attention Free Transformer

Shuangfei Zhai, Walter Talbott, Nitish Srivastava, Chen Huang, Hanlin Goh, Joshua M. Susskind

28 Sept 2020 (modified: 22 Jun 2025)ICLR 2021 Conference Blind SubmissionReaders: Everyone

Keywords: Transformers, attention, efficient

Abstract: We introduce Attention Free Transformer (AFT), an efficient variant of Transformers \citep{transformer} that eliminates the need for dot product attention. AFT offers great simplicity and efficiency compared with standard Transformers, where the multi-head attention operation is replaced with the composition of element-wise multiplications/divisions and global/local pooling. During training time, AFT has linear time and space complexity w.r.t. both the sequence length and feature dimension; in the autoregressive decoding mode, AFT has constant memory and time complexity per step. We show that, surprisingly, we are able to train AFT effectively on challenging benchmarks, and also to match or surpass the standard Transformer counterparts and other efficient variants. In particular, AFT achieves the state-of-the-art result on CIFAR10 autoregressive modeling with much reduced complexity, and also outperforms several efficient Transformer variants on Enwik8.

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics

One-sentence Summary: We propose an efficient Transformer that eliminates attention.

Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 2 code implementations](https://www.catalyzex.com/paper/an-attention-free-transformer/code)

Reviewed Version (pdf): https://openreview.net/references/pdf?id=MVjy0lhIJk

10 Replies

Loading