CHESS: Optimizing LLM Inference via Channel-Wise Thresholding and Selective Sparsification

ACL ARR 2024 June Submission4751 Authors

16 Jun 2024 (modified: 02 Aug 2024)ACL ARR 2024 June SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Deploying large language models (LLMs) on edge devices presents significant challenges due to the substantial computational overhead and memory requirements. Activation sparsification can mitigate these challenges by reducing the number of activated neurons during inference. Existing methods typically employ thresholding-based sparsification based on the statistics of activation tensors. However, these methods do not model the impact of activation sparsification on performance, resulting in significant performance degradation. To address this issue, this paper reformulates the activation sparsification problem and proposes , a general activation sparsification approach via \textbf{CH}annel-wise thr\textbf{E}sholding and \textbf{S}elective \textbf{S}parsification. First, channel-wise thresholding assigns a unique threshold to each activation channel in FFN layers. Then, selective sparsification involves choosing specific layers in the attention modules to apply thresholding-based activation sparsification. Finally, this paper shows the detailed implementation of sparse kernels to accelerate the LLM inference. Experimental results demonstrate that the proposed CHESS achieves lower performance degradation over 8 downstream tasks while activating fewer parameters, thus speeding up the LLM inference by up to 1.27x.
Paper Type: Long
Research Area: Special Theme (conference specific)
Research Area Keywords: efficient inference, activation sparsification
Languages Studied: N/A
Submission Number: 4751
Loading