Reformer: The Efficient Transformer

Nikita Kitaev; Lukasz Kaiser; Anselm Levskaya

Reformer: The Efficient Transformer

Nikita Kitaev, Lukasz Kaiser, Anselm Levskaya

Published: 20 Dec 2019, Last Modified: 22 Oct 2023ICLR 2020 Conference Blind SubmissionReaders: Everyone

Keywords: attention, locality sensitive hashing, reversible layers

TL;DR: Efficient Transformer with locality-sensitive hashing and reversible layers

Abstract: Large Transformer models routinely achieve state-of-the-art results on a number of tasks but training these models can be prohibitively costly, especially on long sequences. We introduce two techniques to improve the efficiency of Transformers. For one, we replace dot-product attention by one that uses locality-sensitive hashing, changing its complexity from O($L^2$) to O($L \log L$), where $L$ is the length of the sequence. Furthermore, we use reversible residual layers instead of the standard residuals, which allows storing activations only once in the training process instead of N times, where N is the number of layers. The resulting model, the Reformer, performs on par with Transformer models while being much more memory-efficient and much faster on long sequences.

Code: https://github.com/google/trax/tree/master/trax/models/reformer

Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 7 code implementations](https://www.catalyzex.com/paper/arxiv:2001.04451/code)

Original Pdf: pdf

23 Replies

Loading