Keywords: Keywords: attention mechanisms, transformer architectures, attention sinks, large language models, interpretability, self-attention, positional encoding, empirical study
TL;DR: We demonstrate that the first token overwhelmingly dominates as a singular attention sink across diverse LLM architectures, with strength varying by model family and input text characteristics.
Abstract: Large Language Models rely on "attention sinks"—initial sequence tokens that accumulate disproportionate attention—for efficient context management. However, the precise formation and positional dominance of these natural sinks remain under-characterized. We present the first systematic empirical study investigating attention sink patterns across three LLM families (GPT-2, Llama, Mistral) and five text categories. Our analysis reveals that the absolute first token (P1) overwhelmingly serves as the dominant natural attention sink, attracting significantly more attention ($p < 0.001$, Cohen's $d > 6.0$) than subsequent initial tokens across all architectures. While P1 dominance is universal, its strength varies by model family—Mistral exhibits the strongest P1 reliance—and is significantly modulated by input characteristics, with short texts eliciting maximal P1 attention and code texts minimal. These findings challenge assumptions about distributed sink importance and provide foundational insights for designing efficient long-context models.
Submission Number: 70
Loading