Linearizing Large Language Models

Published: 10 Jul 2024, Last Modified: 26 Aug 2024COLMEveryoneRevisionsBibTeXCC BY 4.0
Research Area: Compute efficient LMs
Keywords: linear attention, efficient attention, RNN
TL;DR: converting LLMs into RNNs through minimal up-training
Abstract: Linear transformers have emerged as a subquadratic-time alternative to softmax attention and have garnered significant interest due to their fixed recurrent state. However, they suffer from poor scaling and under-perform compute-matched transformers. Prior models such as RWKV and Mamba have attempted to address these shortcomings by proposing novel time-mixing and gating architectures, but pre-training large language models requires significant data and compute investments. In this paper, we propose Scalable UPtraining for Recurrent Attention (SUPRA), an alternative to pre-training linear transformers. We present a method to uptrain existing large pre-trained transformers into Recurrent Neural Networks (RNNs) with a modest compute budget. This allows us to leverage the strong pre- training data and performance of existing transformer LLMs, while requiring 5% of the training cost. We find that our linearization technique leads to competitive performance on standard benchmarks, but we identify a persistent in-context learning shortfall for even the largest linear models.
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the COLM Code of Ethics on
Author Guide: I certify that this submission complies with the submission instructions as described on
Submission Number: 68