HSA: Head-wise Sparse Attention for Efficient and Accurate Long-context Inference

12 Sept 2025 (modified: 03 Dec 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Head-wise Sparse Attention, Efficient Attention, Long-context
TL;DR: In this work, we have proposed HSA, a hybrid architecture that introduces sparsity at the KV-head level.
Abstract: Transformer architectures have become the foundation of large language models (LLMs), excelling at sequential modeling via the self-attention mechanism. However, the quadratic computational complexity and linear KV cache growth of self-attention limit scalability in long-context scenarios. Sparse attention mechanisms, especially sliding window attention (SWA), help reduce these costs but inevitably constrain access to global context, which can degrade performance in tasks requiring long-range dependencies. While hybrid architectures that alternate between full-attention and SWA layers help mitigate this issue, their layer-wise sparsity pattern introduces a 'weakest-link' effect in which global context is inevitably lost in sparse layers, and the resulting degradation becomes more severe as the proportion of such layers increases. In this work, we introduce Head-wise Sparse Attention (HSA), a hybrid architecture that applies sparsity at the KV-head level. Unlike layer-wise sparse designs that impose a uniform sparsity pattern across all heads in a layer, HSA introduces sparsity at the KV-head level: a subset of heads is retained with full attention to preserve long-range dependencies, while the rest are converted to SWA for efficiency. This head-wise design ensures that every layer maintains global context through at least one full-attention KV head, while simultaneously reducing computation and KV-cache requirements. To decide which heads should remain global, we introduce a discrepancy-based post-training selection strategy that preserves those essential for capturing global context while converting the rest to sparse form. We then continue training to adapt the model to the new KV-head sparsity pattern. Extensive experiments on both public and in-house benchmarks show that HSA consistently outperforms prior layer-wise sparse designs, with the advantages being especially significant in long-context scenarios, while maintaining efficiency.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 4577
Loading