Keywords: attention, locality sensitive hashing, reversible layers
TL;DR: Efficient Transformer with locality-sensitive hashing and reversible layers
Abstract: Large Transformer models routinely achieve state-of-the-art results on
a number of tasks but training these models can be prohibitively costly,
especially on long sequences. We introduce two techniques to improve
the efficiency of Transformers. For one, we replace dot-product attention
by one that uses locality-sensitive hashing, changing its complexity
from O($L^2$) to O($L \log L$), where $L$ is the length of the sequence.
Furthermore, we use reversible residual layers instead of the standard
residuals, which allows storing activations only once in the training
process instead of N times, where N is the number of layers.
The resulting model, the Reformer, performs on par with Transformer models
while being much more memory-efficient and much faster on long sequences.
Code: https://github.com/google/trax/tree/master/trax/models/reformer
Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 7 code implementations](https://www.catalyzex.com/paper/arxiv:2001.04451/code)
Original Pdf: pdf
23 Replies
Loading