No Universal Mechanism for Attention Sink in Transformers: Evidence from GPT-2

No Universal Mechanism for Attention Sink in Transformers: Evidence from GPT-2

ACL ARR 2026 January Submission1158 Authors

28 Dec 2025 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Attention sink; GPT-2

Abstract: Transformers commonly exhibit an attention sink: disproportionately high attention to the first position. We study this behavior in GPT-2–style models with learned query biases and absolute positional embeddings. Combining analysis with targeted interventions, we find that the sink arises from the interaction among (i) a learned query bias, (ii) the first-layer transformation of the positional encoding and (iii) structure in the key projection. Together with observations of sinks in models without query biases or absolute positional embeddings (e.g., ALiBi), this indicates that attention sinks do not arise from a single universal mechanism but instead depend on architecture. These findings inform mitigation of attention sink, and motivate broader investigation of sink mechanisms across different architectures.

Paper Type: Short

Research Area: Interpretability and Analysis of Models for NLP

Research Area Keywords: counterfactual / contrastive explanations

Contribution Types: Model analysis & interpretability

Languages Studied: English

Submission Number: 1158

Loading