Abstract: Transformers with linear recurrent modeling offer linear-time training and constant-memory inference. Despite their demonstrated efficiency and performance, pretraining such non-standard architectures from scratch remains costly and risky. The linearization of large language models (LLMs) transforms pretrained standard models into linear recurrent structures, enabling more efficient deployment. However, current linearization methods typically introduce additional feature map modules that require extensive fine-tuning and overlook the gating mechanisms used in state-of-the-art linear recurrent models. To address these issues, this paper presents \textbf{Liger}, short for \underline{\textbf{Li}}nearizing LLMs to \underline{\textbf{g}}at\underline{\textbf{e}}d \underline{\textbf{r}}ecurrent structures. Liger is a novel approach for converting pretrained LLMs into gated linear recurrent models without adding extra parameters. It repurposes the pretrained key matrix weights to construct diverse gating mechanisms, facilitating the formation of various gated recurrent structures while avoiding the need to train additional components from scratch. Using lightweight fine-tuning with Low-Rank Adaptation (LoRA), Liger restores the performance of the linearized gated recurrent models to match that of the original LLMs. Additionally, we introduce Liger Attention, an intra-layer hybrid attention mechanism, which significantly recovers 93\% of the Transformer-based LLM performance at 0.02\% pre-training tokens during the linearization process, achieving competitive results across multiple benchmarks, as validated on models ranging from 1B to 8B parameters.
Lay Summary: Current large language models (LLMs), such as those powering chatbots like ChatGPT, predominantly rely on Transformer architectures. While effective, these LLM architectures suffer from inefficiency. Recently proposed gated linear recurrent model architectures offer promising computational efficiency to address Transformer limitations, yet validating their potential remains challenging due to the prohibitive computational resources required for training LLMs.
In this work we present Liger, a novel technique for converting existing Transformer-based LLMs to gated linear recurrent structures. Unlike prior linearization methods that require multi-stage training, Liger constructs gating modules through parameter-free pooling operations, enabling full reuse of pretrained Transformer LLM weights without newly introducing any learnable parameters. Additionally, we introduce Liger Attention, an intra-layer hybrid attention mechanism, which significantly recovers 93% of the Transformer-based LLM performance at 0.02% pre-training tokens during the linearization process.
Experiments across diverse commonsense reasoning and knowledge benchmarks demonstrate Liger's superiority, featuring linear computational complexity and constant memory usage, and can be applied to various linear recurrent model structures with gating modules.
Application-Driven Machine Learning: This submission is on Application-Driven Machine Learning.
Link To Code: https://github.com/OpenSparseLLMs/Linearization
Primary Area: Deep Learning->Large Language Models
Keywords: Large Language Model, Linear-Complexity Sequence Modeling, LoRA
Submission Number: 4440
Loading