Hierarchy-Aided Sparse Attention For Fast LLMs Prefilling Inference

ICLR 2025 Conference Submission1328 Authors

17 Sept 2024 (modified: 13 Oct 2024)ICLR 2025 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Long-Context LLM; Pre-Filling Acceleration; Sparse Attention
TL;DR: A method for accelerating the pre-filling phase of LLMs using hierarchical attention.
Abstract: Pre-filling Large Language Models (LLMs) with long-context inputs is computationally expensive due to the quadratic complexity of full attention. While global attention is essential during decoding, its importance diminishes during pre-filling, where the focus is on contextualizing tokens rather than predicting the next one. Building on prior work, we apply diagonal block sparse attention during the pre-filling phase, reducing attention-related FLOPs by over 90\% without significant degradation in language modeling performance. To address the remaining performance gap, we propose \textbf{H}ierarchy-\textbf{A}ided \textbf{S}parse \textbf{A}ttention (HASA), which incorporates a specialized transformer branch. This branch extracts global embeddings from each chunk and aligns local attention with full-attention, facilitating cross-chunk interaction. HASA stabilizes sparse attention computations, making the pre-filling phase highly efficient, particularly in long-sequence scenarios. While HASA significantly accelerates the pre-filling phase, we ensure robust language modeling performance by enabling interaction between global embeddings across chunks, which prevents the performance degradation typically observed in sparse attention mechanisms. Given that there are limited methods specifically accelerating pre-filling, our baselines include various open-source long-context models. Across multiple benchmarks, HASA not only maintains performance but also outperforms baseline models in certain scenarios. We will release the models upon acceptance.
Supplementary Material: zip
Primary Area: generative models
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 1328
Loading