A Dot Product Attention Free Transformer

Shuangfei Zhai; Walter Talbott; Nitish Srivastava; Chen Huang; Hanlin Goh; Ruixiang ZHANG; Joshua M. Susskind

A Dot Product Attention Free Transformer

Shuangfei Zhai, Walter Talbott, Nitish Srivastava, Chen Huang, Hanlin Goh, Ruixiang ZHANG, Joshua M. Susskind

29 Sept 2021 (modified: 13 Feb 2023)ICLR 2022 Conference Withdrawn SubmissionReaders: Everyone

Abstract: We introduce Dot Product Attention Free Transformer (DAFT), an efficient variant of Transformers \citep{transformer} that eliminates the query-key dot product in self attention. The core idea is to construct a decomposable attention map for each dimension of the query, key and value. This compositionality enables an implementation where the attention tensor does not to be computed or stored explicitly. A DAFT layer has a memory complexity linear w.r.t. both the context size and the dimension of features, making it compatible with both large input and model sizes. We also introduce DAFT-conv, a model variant that takes advantage of locality and spatial weight sharing while maintaining global connectivity. We conduct experiments on ImageNet-1K classification, as well as CIFAR10 and Enwik8, two autoregressive modeling tasks. We show that DAFT demonstrates competitive performance on all the benchmarks, while providing excellent efficiency at the same time.

5 Replies

Loading