Abstract: Linear attention enhances inference efficiency of Transformer and has attracted research interests as an efficient backbone of language models. Existing linear attention based models usually exploit decay factor based positional encoding (PE), where attention scores decay exponentially with increasing relative distance. However, most work manually designs a non-trainable base of exponential calculation (decay factor), which limits further optimization. Our analysis reveals that direct training decay factor is unstable. To address this problem, we propose a novel PE for linear attention named Disentangle to Decay (D2D). D2D disentangles decay factor into two parts to achieve further optimization and stable training. Moreover, D2D can be transformed into recurrent form for efficient inference. Experiments demonstrate that D2D achieves stable training of decay factor, and enhances performance of linear attention in both normal context length and length extrapolation scenarios.
Paper Type: Long
Research Area: Generation
Research Area Keywords: efficient models; model architectures; inference methods
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Approaches low compute settings-efficiency, Theory
Languages Studied: English
Submission Number: 5573
Loading