Disentangle to Decay: Linear Attention with Trainable Positional Decay for Length ExtrapolationDownload PDF

Anonymous

16 Feb 2024ACL ARR 2024 February Blind SubmissionReaders: Everyone
Abstract: Transformer architecture has significantly advanced Natural Language Processing (NLP) by delivering outstanding performance. However, it faces challenges with efficiency and processing long sequences, attributed to its quadratic time complexity. Linear attention offers a more efficient linear time solution but falls short in language modelling and length extrapolation compared with traditional Transformer. To enhance the performance of linear attention and fully leverage its capability in modelling long sequences, we begin with positional encoding, specifying the constraints required for positional encoding by linear attention. Building upon these constraints, we design a positional encoding for linear attention, named Disentangle to Decay (D2D), which allows for a seamless conversion between absolute positional encoding (APE) and relative positional encoding (RPE). To alleviate the instability of directly training D2D, we disentangle D2D into the combination of RPE and APE, which greatly improves the stability while ensuring the efficiency of model training. Experiments result shows that, application of D2D in linear attention significantly improves performance in language modelling and length extrapolation, demonstrating strong competitiveness with vanilla Transformer and outperforming other positional encodings.
Paper Type: long
Research Area: Machine Learning for NLP
Contribution Types: Model analysis & interpretability, Approaches low compute settings-efficiency, Theory
Languages Studied: English
0 Replies

Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview