BurstAttention: An Efficient Distributed Attention Framework for Extremely Long Sequences

Ao Sun; Weilin Zhao; Xu Han; Cheng Yang; Zhiyuan Liu; Chuan Shi; Maosong Sun; Shengnan Wang; Teng Su

BurstAttention: An Efficient Distributed Attention Framework for Extremely Long Sequences

Ao Sun, Weilin Zhao, Xu Han, Cheng Yang, Zhiyuan Liu, Chuan Shi, Maosong Sun, Shengnan Wang, Teng Su

19 Sept 2023 (modified: 25 Aug 2025)Submitted to ICLR 2024EveryoneRevisionsBibTeX

Primary Area: infrastructure, software libraries, hardware, etc.

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Keywords: Large Language Models; Distributed Deep Learning; Long-sequence Processing; Transformer Attention

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.

Abstract: Effective attention modules have played a crucial role in the success of Transformer-based large language models (LLMs), but the quadratic time and memory complexities of these attention modules also pose a challenge when processing long sequences. One potential solution for the long sequence problem is to utilize distributed clusters to parallelize the computation of attention modules across multiple devices (e.g., GPUs). However, adopting a distributed approach inevitably introduces extra memory overheads to store local attention results and incurs additional communication costs to aggregate local results into global ones. In this paper, we propose a distributed attention framework named ``BurstAttention'' to optimize memory access and communication operations at both the global cluster and local device levels. In our experiments, we compare BurstAttention with other competitive distributed attention solutions for long sequence processing. The experimental results under different length settings demonstrate that BurstAttention offers significant advantages for processing long sequences compared with these competitive baselines, reducing 40\% communication overheads and achieving 2$\times$ speedup during training 32K sequence length on 8$\times$A100.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 1934

Loading