Research Area: Compute efficient LMs
Keywords: linear attention, efficient attention, RNN
TL;DR: converting LLMs into RNNs through minimal up-training
Abstract: Linear transformers have emerged as a subquadratic-time alternative to softmax
attention and have garnered significant interest due to their fixed recurrent state.
However, they suffer from poor scaling and under-perform compute-matched
transformers. Prior models such as RWKV and Mamba have attempted to address
these shortcomings by proposing novel time-mixing and gating architectures,
but pre-training large language models requires significant data and compute
investments. In this paper, we propose Scalable UPtraining for Recurrent Attention
(SUPRA), an alternative to pre-training linear transformers. We present a method
to uptrain existing large pre-trained transformers into Recurrent Neural Networks
(RNNs) with a modest compute budget. This allows us to leverage the strong pre-
training data and performance of existing transformer LLMs, while requiring 5%
of the training cost. We find that our linearization technique leads to competitive
performance on standard benchmarks, but we identify a persistent in-context
learning shortfall for even the largest linear models.
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the COLM Code of Ethics on https://colmweb.org/CoE.html
Author Guide: I certify that this submission complies with the submission instructions as described on https://colmweb.org/AuthorGuide.html
Submission Number: 68
Loading