Keywords: Sparse Activation, Large Language Model, Inference Speedup
Abstract: Sparse activation accelerates the decoding of large language models by eliminating redundant computations and reducing memory access during matrix multiplications. Current approaches have potential limitations as they rely on the strong assumption that "values across different dimensions of hidden states are drawn from independent and identically distributed random variables." Our research challenges this assumption by analyzing how causal dependencies exist between tokens and correlations exist between different dimensions of hidden states. Building on this insight, we introduce Normalized Sparse Activation (NorSA), a method that accounts for inter-dimensional relationships and integrates contextual information through rotation and norm-based thresholding. NorSA achieves superior performance while maintaining computational efficiency. Experiments across LLaMA, Mistral, and Qwen model series show that NorSA consistently outperforms existing methods. For LLaMA3-8B with 50% activation sparsity, NorSA narrows the perplexity gap to only 0.44 points relative to the dense model, while restricting the zero-shot accuracy decline to a mere 1.23%, surpassing La RoSA by 1.63% and TEAL by 3.9%.
Supplementary Material: zip
Primary Area: foundation or frontier models, including LLMs
Submission Number: 5866
Loading