S2-Attention: Hardware-Aware Context Sharding Among Attention Heads

Xihui Lin; Yunan Zhang; Suyu Ge; Liliang Ren; Barun Patra; Vishrav Chaudhary; Hao Peng; Xia Song

S2-Attention: Hardware-Aware Context Sharding Among Attention Heads

Xihui Lin, Yunan Zhang, Suyu Ge, Liliang Ren, Barun Patra, Vishrav Chaudhary, Hao Peng, Xia Song

28 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: efficient transformer, kernel, sparse attention, long context, efficiency, infrastructure, pre-training, inference, sparsity, software library

TL;DR: We implement a hard-aware sparse self-attention kernel to support highly customizable sparsity pattern. We pretrain a wide range of sparsity patterns and propose Sparsely-Sharded Attention, which achieve perfect 128k needle and 20X speed-up.

Abstract: Sparse attention, which selectively attends to a subset of tokens in the context, has been an established approach to enhance the efficiency of Transformers. However, its theoretical reduction in FLOPs has rarely translated into wall-clock speed-up over its dense attention counterparts, mainly due to the lack of hardware-level optimizations like FlashAttention. Meanwhile, it remains unclear wheter sparse attention can maintain the model's quality at a scale of today's large language models (LLMs), and how this can be achieved. %how to guarantee model quality with sparse attention, given the decoder-only architecture and model scale of modern LLMs. This paper presents Sparsely-Sharded(S2) Attention, a Triton library that provides kernel optimization for sparse attention customizable at both per-head and per-context-range levels. S2-Attention enables the exploration of novel and high-performance sparse attention techniques, which we demonstrate through extensive ablations across a wide range of sparse attention deisngs at various model scales. % design heuristics across model scale. From these insights, we present several basic guidelines to design sparse attention that can achieve not only practical efficiency improvements, but also strong performance on downstream tasks. % heterogeneous context sharding, union completeness, and inevitable density as principles to improve LLM training and inference efficiency without compromising model quality. To achieve high parallelization and optimized memory IO, sparse attention should \textbf{shard the context heterogeneously across attention heads}, where each head attends to a different subset of tokens while \textbf{collectively covering the full context}. Meanwhile, we find hybrid architectures combining sparse and dense attention particularly beneficial in practice. These design choices lead to a novel sparse attention architecture, which we evaluate with 1.3B, 7B models. It achieves wall-clock speedup of 8.79X, 15.87X, 25.3X compared to the strong FlashAttention-2 baseline with strong downstream performance on-par with full attention and perfect retrieval performance at a 128k context length. % on-par downstream performance and perfect 128k needle retrieval. % In 1.3B, 7B, 70B Llama architecture models, S2-Attention following the principles delivers a speedup of 7X, 12X, 22X wall-clock speed-up compared to FlashAttention-2, while achieving on-par downstream performance and perfect 128k needle retrieval. In inference, for 7B models, our model, with the help of our S2-Attention kernel, achieves 4.5x speed-up compared to dense counterparts. S2-Attention will be released with easy-to-customize APIs for direct usage in Megatron and vLLM. We hope they will help future research develop sparse attention algorithms to improve the efficiency of large language models.

Primary Area: foundation or frontier models, including LLMs

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 13347

Loading