Attention Sink Is Sinking Causality: Causal Interpretation of Self-Attention in Decoder Language Models and Mitigating Attention Sink for Improved Interpretability

ACL ARR 2025 May Submission4890 Authors

20 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Self-attention is widely regarded as a key mechanism enabling Transformers to dynamically focus on relevant input tokens. However, this focusing process can become distorted by attention sinks —tokens such as the beginning-of-sequence marker or other function words that receive disproportionately high attention weights despite offering minimal semantic contribution. In this paper, we study the causal significance of self-attention in decoder-based Large Language Models (LLMs) for classification tasks, with a particular emphasis on how these attention sinks impact interpretability. We first document the prevalence of attention sink across diverse sentiment and short-prompt classification datasets, revealing that seemingly crucial tokens often have little causal influence on final predictions making it hard to interpret the LLM's thereby making them a blackbox models. We then propose and evaluate mitigation strategies—such as reweighting the attention distribution to reduce the effect of attention sinks. Empirical results show that these techniques improve alignment between attention weights and truly influential tokens, thereby enhancing the causal interpretability of the self-attention mechanism. Our findings underscore the importance of identifying and alleviating attention sinks, particularly for applications where transparent and trustworthy model explanations are paramount.
Paper Type: Long
Research Area: Interpretability and Analysis of Models for NLP
Research Area Keywords: Self-Attention, Causal Interpretability, Attention Sink, Sentiment Classification, Transformer Models
Contribution Types: Model analysis & interpretability
Languages Studied: English
Submission Number: 4890
Loading