Keywords: Linear Attention, Linearizing Transformers, Low-rank Adaptation, Large Language Models, Architecture Distillation
Abstract: Recent works show we can linearize large language models (LLMs)---swapping the attentions of popular Transformer-based LLMs with subquadratic analogs---to create subquadratic LLMs at fractions of typical pretraining costs. However, to adapt LLMs to the new layers, these approaches significantly degrade model quality, involve expensive full-model training over billions of tokens, and remain limited to smaller 1.3B to 7B LLMs. We thus propose Low-rank Linear Conversion via Attention Transfer (LoLCATs), a simple two-step method that improves LLM linearizing quality with magnitudes less memory and compute. We base these steps on two findings. First, rather than adapt LLMs to completely new layers, we find we can replace their softmax attentions with near-equivalent linear attentions, simply by training these layers to approximate their softmax counterparts ("attention transfer"). Then, this enables simply using low-rank adaptation (LoRA) to adjust for approximation errors, recovering quality in fully subquadratic LLMs. In experiments, LoLCATs significantly improves linearizing quality, training-efficiency, and scalability. First, by linearizing Llama 3 8B and Mistral 7B v0.1, LoLCATs produces state-of-the-art subquadratic LLMs that outperform both prior linearizing methods (SUPRA) and strong pretrained 7B Transformer alternatives (_e.g._, RWKV, Mamba, Griffin) by 2.9-8.0 points on popular zero-shot LM Evaluation Harness tasks (+20 points on 5-shot MMLU). Next, LoLCATs achieves the above with only 0.2\% of past methods' model parameters and 0.4\% of past methods' training tokens. Finally, we apply LOLCATS to create the first linearized 70B and 405B LLMs (50× larger than prior work). When compared with prior methods under the same compute budgets, LOLCATS significantly improves linearizing quality, closing the gap between linearized and original Llama 3.1 70B and 405B LLMs by 78.7% and 77.4% on 5-shot MMLU.
Primary Area: foundation or frontier models, including LLMs
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 8323
Loading