On-the-Fly Adaptive Distillation of Transformer to Dual-State  Linear Attention for Long-Context LLM Serving

Yeonju Ro; Zhenyu Zhang; Souvik Kundu; Zhangyang Wang; Aditya Akella

On-the-Fly Adaptive Distillation of Transformer to Dual-State Linear Attention for Long-Context LLM Serving

Yeonju Ro, Zhenyu Zhang, Souvik Kundu, Zhangyang Wang, Aditya Akella

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Large language models (LLMs) excel at capturing global token dependencies via self-attention but face prohibitive compute and memory costs on lengthy inputs. While sub-quadratic methods (e.g., linear attention) can reduce these costs, they often degrade accuracy due to overemphasizing recent tokens. In this work, we first propose *dual-state linear attention* (**DSLA**), a novel design that maintains two specialized hidden states—one for preserving historical context and one for tracking recency—thereby mitigating the short-range bias typical of linear-attention architectures. To further balance efficiency and accuracy under dynamic workload conditions, we introduce DSLA-*Serve*, an online *adaptive distillation* framework that progressively replaces Transformer layers with DSLA layers at inference time, guided by a sensitivity-based layer ordering. DSLA-*Serve* uses a chained fine-tuning strategy to ensure that each newly converted DSLA layer remains consistent with previously replaced layers, preserving the overall quality. Extensive evaluations on commonsense reasoning, long-context QA, and text summarization demonstrate that DSLA-*Serve* yields **2.3×** faster inference than Llama2-7B and **3.0×** faster than the hybrid Zamba-7B, while retaining comparable performance across downstream tasks. Our ablation studies show that DSLA’s dual states capture both global and local dependencies, addressing the historical-token underrepresentation seen in prior linear attentions.

Lay Summary: Large language models (LLMs), like those behind chatbots and AI assistants, are powerful but slow and memory-intensive—especially when processing long texts. Some faster alternatives exist, but they often lose accuracy because they focus too much on recent words and neglect earlier parts of the input. In this work, we introduce a new method called **Dual-State Linear Attention (DSLA)**. It maintains two separate memory states—one for past context and one for recent information—allowing it to stay accurate while being more efficient. We also develop an automatic system called DSLA-*Serve*, which gradually replaces the heavier parts of the model with our lighter DSLA components during real-time use. It performs this replacement carefully to ensure the AI continues to perform well, even as it switches to more efficient alternatives. When tested on challenging tasks like reasoning, answering long questions, and summarizing text, DSLA-*Serve* ran up to 3× faster than existing models—without sacrificing performance.

Link To Code: https://github.com/utnslab/DSLA-Serve

Primary Area: Deep Learning->Large Language Models

Keywords: LLM, Efficient serving, Linear attention, adaptive inference, dynamic serving

Submission Number: 1381

Loading