Unveiling Simplicities of Attention: Adaptive Long-Context Head Identification

06 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: long context, llms, attention mechanism, sparse attention
TL;DR: We study a simple efficient criterion to query-adaptively determine whether an attention head processes long-context information.
Abstract: The ability to process long contexts is crucial for many natural language processing tasks, yet it remains a significant challenge. While substantial progress has been made in enhancing the efficiency of attention mechanisms, there is still a gap in understanding how attention heads function in long-context settings. In this paper, we observe that while certain heads consistently attend to local information only, others swing between attending to local and long-context information depending on the query. This raises the question: can we identify which heads require long-context information to predict the next token accurately? We demonstrate that it's possible to predict which heads are crucial for long-context processing using only local keys: the core idea is to exploit a simple model for the long-context scores via second moment approximations. These findings contrast with earlier non-adaptive sparsifying schemes.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 2645
Loading