On-the-Fly Adaptive Distillation of Transformer to Dual-State Linear Attention for Long-Context LLM Serving
Abstract: Large language models (LLMs) excel at capturing global token dependencies via self-attention but face prohibitive compute and memory costs on lengthy inputs. While sub-quadratic methods (e.g., linear attention) can reduce these costs, they often degrade accuracy due to overemphasizing recent tokens. In this work, we first propose *dual-state linear attention* (**DSLA**), a novel design that maintains two specialized hidden states—one for preserving historical context and one for tracking recency—thereby mitigating the short-range bias typical of linear-attention architectures. To further balance efficiency and accuracy under dynamic workload conditions, we introduce
DSLA-*Serve*, an online *adaptive distillation* framework that progressively replaces Transformer layers with DSLA layers at inference time, guided by a sensitivity-based layer ordering.
DSLA-*Serve* uses a chained fine-tuning strategy to ensure that each newly converted DSLA layer remains consistent with previously replaced layers, preserving the overall quality. Extensive evaluations on commonsense reasoning, long-context QA, and text summarization demonstrate that
DSLA-*Serve* yields **2.3×** faster inference than Llama2-7B and **3.0×** faster than the hybrid Zamba-7B, while retaining comparable performance across downstream tasks. Our ablation studies show that DSLA’s dual states capture both global and local dependencies, addressing the historical-token underrepresentation seen in prior linear attentions.
Lay Summary: Large language models (LLMs), like those behind chatbots and AI assistants, are powerful but slow and memory-intensive—especially when processing long texts. Some faster alternatives exist, but they often lose accuracy because they focus too much on recent words and neglect earlier parts of the input.
In this work, we introduce a new method called **Dual-State Linear Attention (DSLA)**. It maintains two separate memory states—one for past context and one for recent information—allowing it to stay accurate while being more efficient.
We also develop an automatic system called DSLA-*Serve*, which gradually replaces the heavier parts of the model with our lighter DSLA components during real-time use. It performs this replacement carefully to ensure the AI continues to perform well, even as it switches to more efficient alternatives.
When tested on challenging tasks like reasoning, answering long questions, and summarizing text, DSLA-*Serve* ran up to 3× faster than existing models—without sacrificing performance.
Link To Code: https://github.com/utnslab/DSLA-Serve
Primary Area: Deep Learning->Large Language Models
Keywords: LLM, Efficient serving, Linear attention, adaptive inference, dynamic serving
Submission Number: 1381
Loading