SLAKE: Softmax-Approximated Training-Free Linear Attention with KV-Cache Eviction for Long-Sequence LLMs
Keywords: large language models, linear attention, kv cache compression, self attention approximation
TL;DR: SLAKE is a training-free framework that combines linear attention and KV-cache eviction to achieve efficient, accurate 128K-token long-context inference.
Abstract: Recent advances in transformer-based large language models (LLMs) have enabled inference over contexts as long as 128K tokens. However, the quadratic computational and memory costs of full self-attention remain a fundamental bottleneck at such scales. Prior efforts to mitigate this challenge largely fall into two camps: (i) structural approximations (e.g., linear attention) that reduce asymptotic complexity but typically require costly retraining, and (ii) KV-cache optimizations (e.g., eviction or merging) that are training-free yet inevitably discard information. We introduce Softmax-Approximated Training-Free Linear Attention with KV-Cache Eviction (SLAKE), a novel framework that unifies the complementary advantages of these two paradigms. At its core, SLAKE employs Partially Taylor-Approximated Attention (PTAA), which leverages a first-order Taylor expansion to selectively linearize the Softmax attention kernel. This design enables tokens deemed low-importance via eviction scoring to be processed efficiently with linear attention, while preserving exact Softmax computation for high-salience tokens. To further improve cache efficiency, we propose Value-Aware Budget Scoring (VABS), a new allocation strategy that incorporates value contributions and overcomes key limitations of previous eviction heuristics. Extensive experiments on LLaMA-3 8B demonstrate that SLAKE delivers up to 10$\times$ inference speedup and 30.8\% peak-memory reduction on 128K-token sequences, while keeping accuracy loss below 4\%. To our knowledge, SLAKE is the first training-free approach to jointly integrate linear attention with KV-cache eviction, establishing a new state of the art among long-context, training-free methods.
Supplementary Material: zip
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 15640
Loading