Keywords: Large Language Model, Sparsity
Abstract: Large Language Models (LLMs) deliver strong capabilities but incur high inference costs due to dense computation and memory access. Training-free activation sparsity is a promising approach for efficient LLM inference, leveraging its data adaptation and low computational overhead. However, existing methods typically only rely on activation information and a uniform sparsity ratio, overlooking the critical interplay with weights and inter-block sensitivity variation, which leads to suboptimal performance. In this paper, we examine these limitations and identify two key phenomena in modern LLMs: 1) less significant activations may align with highly important weights, and 2) sparsity sensitivity varies non-monotonically across model blocks. To address these issues, we propose a novel Weight-aware Mixed-Granularity Training-free Activation Sparsity (WiSparse) method that leverages both activation and weight information and enables adaptive sparsity allocation across different granularities. Specifically, we introduce a weight-aware activation sparsification mechanism that integrates activation magnitudes with precomputed weight norms to more accurately identify salient channels. This is combined with a mixed-granularity sparsity allocation scheme featuring a coarse-to-fine strategy: a global sparsity budget is first distributed across blocks via evolutionary search to protect sensitive regions, and subsequently refined at finer granularities within each block to minimize reconstruction error. We improve existing sparse kernels and demonstrate the effectiveness of the proposed method via extensive experiments conducted on three representative models. Notably, at 50% sparsity, WiSparse preserves 97% of Llama3.1’s dense model performance, surpassing the strongest baseline by 2.23 percentage points while achieving a 21.4% acceleration in end-to-end inference speed. Our research contributes to advancing the performance limits of training-free approaches for efficient LLM inference, effectively pushing the boundaries of achievable speedup without training.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 16471
Loading