You Can Train from Scratch: Further Discussion on the Long Range Arena

Pablo Miralles-González; Javier Huertas-Tato; Alejandro Martín; David Camacho

You Can Train from Scratch: Further Discussion on the Long Range Arena

Pablo Miralles-González, Javier Huertas-Tato, Alejandro Martín, David Camacho

26 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Long Range Arena, Transformers

TL;DR: Rotary transformers achieve SOTA results in the Long Range Arena without pretraining with other training techniques. We do ablation studies and discuss the reasons. We show that short-range dependencies account for a good portion of performance.

Abstract: Despite their success, Transformers suffer from quadratic complexity in the sequence length, limiting their applicability to long-range dependency problems and making them expensive to train and run. After many proposals to address this issue, the Long Range Arena (LRA) was suggested as a benchmark to evaluate the performance of new models in long-range dependency modeling tasks. The Transformer and its variants performed poorly on this benchmark, and a new series of architectures such as State Space Models (SSMs) gained some traction, greatly outperforming Transformers in the LRA. Recent work has shown that with a denoising pretraining phase, Transformers can achieve competitive results in the LRA with these new architectures. In this work, we discuss and explain the superiority of architectures such as MEGA and SSMs in the Long Range Arena, as well as the recent improvement in the results of Transformers, pointing to the positional and local nature of the tasks. We show that while the LRA is a benchmark for long-range dependency modeling, in reality most of the performance comes from short-range dependencies. By using rotary embeddings and training techniques to mitigate its data inefficiency, the Transformer is also able to reach state-of-the-art performance without a separate pretraining phase. What is more, with the same techniques, we are able to remove all restrictions from SSM convolutional kernels and learn fully parameterized convolutions without decreasing performance, suggesting that the design choices behind SSMs merely added inductive biases and learning efficiency for these particular tasks. Our insights indicate that LRA results should be interpreted with caution and call for a redesign of the benchmark.

Primary Area: datasets and benchmarks

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 7089

Loading