Keywords: transformer, linear transformer, long sequences, fast Fourier transform, positional encoding, long range arena
Abstract: Transformers achieve remarkable performance in various domains, including NLP, CV, audio processing, and graph analysis. However, they do not scale well on long sequence tasks due to their quadratic complexity w.r.t. the input’s length. Linear Transformers were proposed to address this limitation. However, these models have shown weaker performance on the long sequence tasks comparing to the original one. In this paper, we explore Linear Transformer models, rethinking their two core components. Firstly, we improved Linear Transformer with $\textbf{S}$hift-$\textbf{I}$nvariant $\textbf{K}$ernel $\textbf{F}$unction $\textbf{SIKF}$, which achieve higher accuracy without loss in speed. Secondly, we introduce $\textbf{FastRPB}$ which stands for $\textbf{Fast}$ $\textbf{R}$elative $\textbf{P}$ositional $\textbf{B}$ias, which efficiently adds positional information to self-attention using Fast Fourier Transformation. FastRPB is independent of the self-attention mechanism and can be combined with an original self-attention and all its efficient variants. FastRPB has $\mathcal{O}(N\log{N})$ computational complexity, requiring $\mathcal{O}(N)$ memory w.r.t. input sequence length $N$.
We compared introduced modifications with recent Linear Transformers in different settings: text classification, document retrieval, and image classification. Extensive experiments with FastRPB and SIKF demonstrate that our model significantly outperforms another efficient positional encodings method in accuracy, having up to x1.5 times higher speed and requiring up to x10 times less memory than the original Transformer.
One-sentence Summary: Improving efficient transformers with new kernel function and fast relative positional embeddings
Supplementary Material: zip
9 Replies
Loading