Abstract: Length extrapolation algorithms based on Rotary position embedding (RoPE) have shown promising results in extending the context length of language models. However, understanding how position embedding can capture longer-range contextual information remains elusive.
Based on the intuition that different dimensions correspond to different frequency of changes in RoPE encoding, we conducted a dimension-level analysis to investigate the correlation between a hidden dimension of an attention head and its contribution to capturing long-distance dependencies.
Using our correlation metric, we identified a particular type of attention heads, which we named \emph{Positional Heads}, from various length-extrapolated models.
These heads exhibit a strong focus on long-range information interaction and play a pivotal role in long input processing, as evidence by our ablation.
We further demonstrate the correlation between the efficiency of length extrapolation and the extension of the high-dimensional attention allocation of these heads.
The identification of Positional Heads provides insights for future research in long-text comprehension.
Paper Type: Long
Research Area: Special Theme (conference specific)
Research Area Keywords: feature attribution
Contribution Types: Model analysis & interpretability
Languages Studied: English
Submission Number: 1642
Loading