Keywords: performer, transformer, attention, softmax, approximation, linear, bert, bidirectional, unidirectional, orthogonal, random, features, FAVOR, kernel, generalized, sparsity, reformer, linformer, protein, trembl, uniprot
Abstract: We introduce Performers, Transformer architectures which can estimate regular (softmax) full-rank-attention Transformers with provable accuracy, but using only linear (as opposed to quadratic) space and time complexity, without relying on any priors such as sparsity or low-rankness. To approximate softmax attention-kernels, Performers use a novel Fast Attention Via positive Orthogonal Random features approach (FAVOR+), which may be of independent interest for scalable kernel methods. FAVOR+ can also be used to efficiently model kernelizable attention mechanisms beyond softmax. This representational power is crucial to accurately compare softmax with other kernels for the first time on large-scale tasks, beyond the reach of regular Transformers, and investigate optimal attention-kernels. Performers are linear architectures fully compatible with regular Transformers and with strong theoretical guarantees: unbiased or nearly-unbiased estimation of the attention matrix, uniform convergence and low estimation variance. We tested Performers on a rich set of tasks stretching from pixel-prediction through text models to protein sequence modeling. We demonstrate competitive results with other examined efficient sparse and dense attention methods, showcasing effectiveness of the novel attention-learning paradigm leveraged by Performers.
One-sentence Summary: We introduce Performers, linear full-rank-attention Transformers via provable random feature approximation methods, without relying on sparsity or low-rankness.
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics
Code: [![github](/images/github_icon.svg) google-research/google-research](https://github.com/google-research/google-research) + [![Papers with Code](/images/pwc_icon.svg) 11 community implementations](https://paperswithcode.com/paper/?openreview=Ua6zuk0WRH)
Data: [CIFAR-10](https://paperswithcode.com/dataset/cifar-10), [ImageNet](https://paperswithcode.com/dataset/imagenet), [PG-19](https://paperswithcode.com/dataset/pg-19)
Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 11 code implementations](https://www.catalyzex.com/paper/arxiv:2009.14794/code)