Disentangle to Decay: Linear Attention with Trainable Positional Decay for Length Extrapolation

Anonymous

Disentangle to Decay: Linear Attention with Trainable Positional Decay for Length Extrapolation

Anonymous

16 Feb 2024ACL ARR 2024 February Blind SubmissionReaders: Everyone

Abstract: Transformer architecture has significantly advanced Natural Language Processing (NLP) by delivering outstanding performance. However, it faces challenges with efficiency and processing long sequences, attributed to its quadratic time complexity. Linear attention offers a more efficient linear time solution but falls short in language modelling and length extrapolation compared with traditional Transformer. To enhance the performance of linear attention and fully leverage its capability in modelling long sequences, we begin with positional encoding, specifying the constraints required for positional encoding by linear attention. Building upon these constraints, we design a positional encoding for linear attention, named Disentangle to Decay (D2D), which allows for a seamless conversion between absolute positional encoding (APE) and relative positional encoding (RPE). To alleviate the instability of directly training D2D, we disentangle D2D into the combination of RPE and APE, which greatly improves the stability while ensuring the efficiency of model training. Experiments result shows that, application of D2D in linear attention significantly improves performance in language modelling and length extrapolation, demonstrating strong competitiveness with vanilla Transformer and outperforming other positional encodings.

Paper Type: long

Research Area: Machine Learning for NLP

Contribution Types: Model analysis & interpretability, Approaches low compute settings-efficiency, Theory

Languages Studied: English

0 Replies

Loading