MVA: Linear Attention with High-order Query-Keys Integration and Multi-level Vocabulary Decomposition

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: Fine-tune transformer to RNN; Linear attention; Linear Model;
Abstract: Linear attention offers the advantages of linear inference time and fixed memory usage compared to Softmax attention. However, training large-scale language models with linear attention from scratch remains prohibitively expensive and exhibits significant performance gaps compared to Softmax-based models. To address these challenges, we focus on transforming pre-trained Softmax-based language models into linear attention models. We unify mainstream linear attention methods using a **high-order QK integration theory** and a **multi-level vocabulary decomposition**. Specifically, the QK integration theory explains the efficacy of combining linear and sparse attention from the perspective of information collection across different frequency bands. The multi-level vocabulary decomposition exponentially expands memory capacity by recursively exploiting compression loss from compressed states. Through detailed error analysis, we demonstrate superior approximation of Softmax attention achieved by our approach. To further improve performance and reduce training costs, we adopt a **soft integration strategy** with attention scores, effectively combining a sliding window mechanism. With less than 100M tokens, our method fine-tunes models to achieve linear complexity while retaining 99\% of their original performance. Compared to state-of-the-art linear attention model and method, our approach improves MMLU scores by 1.2 percentage points with minimal fine-tuning. Furthermore, even without the sliding window mechanism, our method achieves state-of-the-art performance on all test sets with 10B tokens.
Lay Summary: Modern AI language models are powerful but computationally expensive. While newer "linear attention" models promise faster, cheaper performance, training them from scratch remains costly and their results often lag behind traditional methods. Instead of building these efficient models from the ground up, we propose upgrading existing high-performing models to adopt linear attention’s benefits. Our approach works like a targeted retrofit: we analyze how existing models process information and systematically adjust their "attention" mechanisms—the part that determines which words or concepts the model focuses on. By breaking down how the model handles different types of language patterns and memory, we bridge the performance gap between traditional and linear models. After minor tweaks (using less than 1% of the original training data), our upgraded models retain 99% of their original accuracy while running significantly faster. This method outperforms other linear attention techniques and offers a practical path to deploy efficient AI without sacrificing quality. We’ve prioritized simplicity, ensuring even small-scale applications can benefit from these advancements.
Primary Area: Deep Learning->Attention Mechanisms
Keywords: Fine-tune transformer to RNN; Linear attention; Linear Model;
Submission Number: 14640
Loading