NorSA: Accelerate LLM Decoding via Normalized Sparse Activation

Tianteng Gu; Bo Xiao; Ke Zeng; Yanmin Qian

NorSA: Accelerate LLM Decoding via Normalized Sparse Activation

Tianteng Gu, Bo Xiao, Ke Zeng, Yanmin Qian

15 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Sparse Activation, Large Language Model, Inference Speedup

Abstract: Sparse activation accelerates the decoding of large language models by eliminating redundant computations and reducing memory access during matrix multiplications. Current approaches have potential limitations as they rely on the strong assumption that "values across different dimensions of hidden states are drawn from independent and identically distributed random variables." Our research challenges this assumption by analyzing how causal dependencies exist between tokens and correlations exist between different dimensions of hidden states. Building on this insight, we introduce Normalized Sparse Activation (NorSA), a method that accounts for inter-dimensional relationships and integrates contextual information through rotation and norm-based thresholding. NorSA achieves superior performance while maintaining computational efficiency. Experiments across LLaMA, Mistral, and Qwen model series show that NorSA consistently outperforms existing methods. For LLaMA3-8B with 50% activation sparsity, NorSA narrows the perplexity gap to only 0.44 points relative to the dense model, while restricting the zero-shot accuracy decline to a mere 1.23%, surpassing La RoSA by 1.63% and TEAL by 3.9%.

Supplementary Material: zip

Primary Area: foundation or frontier models, including LLMs

Submission Number: 5866

Loading