Abstract: Optimizing LLM inference has become increasingly important as the demand for efficient on-device deployments grows. To reduce the computational overhead in the MLP components, which account for a significant portion of LLM inference, ReLU-fied LLMs have been introduced to maximize activation sparsity. Several sparsity prediction methods have been developed to efficiently skip unnecessary memory accesses and computations by predicting activation sparsity. In this paper, we propose a novel magnitude-based, training-free sparsity prediction technique called Grasp that builds on the existing sign bitbased method for ReLU-fied LLMs. The proposed method enhances prediction accuracy by grouping values considering the distribution within vectors and explicitly accounting for statistical outliers. This allows us to estimate the impact of each element more accurately yet in an efficient way, improving both activation sparsity prediction accuracy and computational efficiency. Compared to the-state-of-the-art technique, Grasp achieves higher sparsity prediction accuracy and $11 \%$ higher skipping efficiency, which corresponds to $1.85 \times$ speedup against the dense inference.
External IDs:dblp:conf/dac/ShinYY25
Loading