Keywords: Reformer, Transformer, Efficient Transformer, Attention, LSH, Reversible Residual Layers, Chunked Feed Forward, Language modelling
Abstract: We attempt to reproduce the central claims of ICLR 2020 Paper "Reformer: The Efficient Transformer"; that the techniques introduced enable performance on par with a traditional Transformer model while being much more memory-efficient and much faster on long sequences. This fast.ai community effort reproduced claims around speed for long sequences and observed a reduction in memory usage. We could not match the performance of a traditional Transformer with Reformer. Finally, substantial coding effort was required, a lack of implementation documentation compounded this. The scope of this work is to verify the claims of memory efficiency and speed on longer sequences of the Reformer. We replicated only the NLP experiments due to limited computational resources. We first reimplemented the original Transformer model and which we then modified. We referred to the authors' code for the model and data pipeline. We used the fastai library for training, Weights and Biases for experiment tracking and nbdev for development. All experiments were done in a single GPU setting. Claims around speed on longer sequences and reduced memory footprint were validated; as sequence length increased, Locality Sensitive Hashing ("LSH") Attention became faster and increasing the number of hashes improved performance. We could not achieve the performance of a traditional Transformer with Reformer. Some experiments were not run for as long as in the paper due to a lack of computational resources. Potentially the under-performance of our Reformer may be due to under-training, implementation differences or nuances in JAX vs Pytorch. Also, exploding gradients were encountered with mixed precision training and several model settings were found to be unstable depending on the random seed or learning rate. Obtaining the data was straightforward as they are commonly used benchmarks. There were no issues reproducing the data pipeline or Chunked Feed Forward layers and code for the Axial Positional Encodings was imported. Substantial effort was made to ensure a correct reimplementation. It was challenging due to many engineering design decisions or hyperparameters not being fully documented. Significant hyperparameter tuning was also needed. The authors were receptive to email correspondence and clarified a number of implementation details. We provide all code and documentation in our Github.
Paper Url: https://openreview.net/forum?id=rkgNKkHtvB¬eId=SJxEEtVosB