Enhancing linear attention with residual learning

Xunhao Lai; Jialiang Kang; Jianqiao Lu; Tong Lin; Pengyu Zhao

Enhancing linear attention with residual learning

Xunhao Lai, Jialiang Kang, Jianqiao Lu, Tong Lin, Pengyu Zhao

07 Sept 2025 (modified: 02 Dec 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Linear Attention, Linear RNN, Large Language Model

TL;DR: We introduce an explicit residual-fitting mechanism to improve the performance of linear attention models.

Abstract: Linear attention offers a linear-time alternative to self-attention but often struggles to capture long-range patterns. We revisit linear attention through a prediction-correction lens and show that prevalent variants can Residual Linear Attention (RLA), a framework that equips linear attention with an explicit residual-fitting mechanism. RLA maintains an auxiliary recurrent state that learns to accumulate residual errors over time and correct the base prediction. We further instantiate a delta-rule version, Residual Delta Net (RDN), incorporating adaptive gating and residual clipping for enhanced correction control and stability. Our implementation leverages highly optimized linear attention kernels and preserves linear time and memory. Across language modeling and recall-intensive evaluations, RLA and RDN consistently outperform their respective baselines and other modern linear-attention methods, narrowing the gap to standard Transformers while retaining linear scaling.

Supplementary Material: zip

Primary Area: foundation or frontier models, including LLMs

Submission Number: 2825

Loading