Linearizing Large Language Models

Jean Mercat; Igor Vasiljevic; Sedrick Scott Keh; Kushal Arora; Achal Dave; Adrien Gaidon; Thomas Kollar

Linearizing Large Language Models

Jean Mercat, Igor Vasiljevic, Sedrick Scott Keh, Kushal Arora, Achal Dave, Adrien Gaidon, Thomas Kollar

Published: 10 Jul 2024, Last Modified: 26 Aug 2024COLMEveryoneRevisionsBibTeXCC BY 4.0

Research Area: Compute efficient LMs

Keywords: linear attention, efficient attention, RNN

TL;DR: converting LLMs into RNNs through minimal up-training

Abstract: Linear transformers have emerged as a subquadratic-time alternative to softmax attention and have garnered significant interest due to their fixed recurrent state. However, they suffer from poor scaling and under-perform compute-matched transformers. Prior models such as RWKV and Mamba have attempted to address these shortcomings by proposing novel time-mixing and gating architectures, but pre-training large language models requires significant data and compute investments. In this paper, we propose Scalable UPtraining for Recurrent Attention (SUPRA), an alternative to pre-training linear transformers. We present a method to uptrain existing large pre-trained transformers into Recurrent Neural Networks (RNNs) with a modest compute budget. This allows us to leverage the strong pre- training data and performance of existing transformer LLMs, while requiring 5% of the training cost. We find that our linearization technique leads to competitive performance on standard benchmarks, but we identify a persistent in-context learning shortfall for even the largest linear models.

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the COLM Code of Ethics on https://colmweb.org/CoE.html

Author Guide: I certify that this submission complies with the submission instructions as described on https://colmweb.org/AuthorGuide.html

Submission Number: 68

Loading