SemSA: Semantic Sparse Attention is hidden in Large Language Models.

Tianyu Fu; Xuefei Ning; Boju Chen; Tianqi Wu; Genghan Zhang; Guohao Dai; Huazhong Yang; Yu Wang

SemSA: Semantic Sparse Attention is hidden in Large Language Models.

Tianyu Fu, Xuefei Ning, Boju Chen, Tianqi Wu, Genghan Zhang, Guohao Dai, Huazhong Yang, Yu Wang

21 Sept 2023 (modified: 25 Mar 2024)ICLR 2024 Conference Withdrawn SubmissionEveryoneRevisionsBibTeX

Keywords: large language model, sparse attention, efficient transformer

TL;DR: Semantic Sparse Attention improves transformer models with unique attention masks, achieving a 15.5x speedup in attention layers and 3.79x end-to-end speedup on OPT-6.7B, without costly retraining.

Abstract: Sparse attention is one of the most effective approaches for addressing the $O(N^2)$ attention complexity of transformer models. Existing methods manually designs a uniform sparse attention mask for all attention heads. However, uniform masks treat different attention heads equally. To preserve necessary attentions for important heads, the masks are unnecessarily dense for unimportant heads, limiting the overall sparsity and wall-clock speedup. Thus, we propose Semantic Sparse Attention (SemSA) paradigm. It uses statistical information to evaluate, generate and optimize different sparse attention masks for different heads. We observe that the acquired attention masks successfully learn different semantic information from the dense pre-trained large language models: some heads focus on contents while others mainly encode the token positions. We optimize SemSA GPU operators and evaluate it on popular large language models OPT-6.7B (2k tokens) and Llama2-7B (4k tokens). Compared with dense PyTorch models, SemSA achieves $4.18\sim11.67\times$ and $1.36\sim2.34\times$ speedup for attention layer and first-token-latency with negligible accuracy loss. Compared with other sparse attention methods optimized with state-of-the-art sparse framework, SemSA achieves up to $1.6\times$ sparsity, $1.4\times$ attention speedup with higher accuracy.

Primary Area: generative models

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 3500

Loading