Sparsing Law: Towards Large Language Models with Greater Activation Sparsity

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: Comprensive studies on the measurement, influential factors, better architectural design and training practice for activation sparsity.
Abstract: Activation sparsity denotes the existence of substantial weakly-contributed neurons within feed-forward networks of large language models (LLMs), providing wide potential benefits such as computation acceleration. However, existing works lack thorough quantitative studies on this useful property, in terms of both its measurement and influential factors. In this paper, we address three underexplored research questions: (1) How can activation sparsity be measured more accurately? (2) How is activation sparsity affected by the model architecture and training process? (3) How can we build a more sparsely activated and efficient LLM? Specifically, we develop a generalizable and performance-friendly metric, named CETT-PPL-1\%, to measure activation sparsity. Based on CETT-PPL-1\%, we quantitatively study the influence of various factors and observe several important phenomena, such as the convergent power-law relationship between sparsity and training data amount, the higher competence of ReLU activation than mainstream SiLU activation, the potential sparsity merit of a small width-depth ratio, and the scale insensitivity of activation sparsity. Finally, we provide implications for building sparse and effective LLMs, and demonstrate the reliability of our findings by training a 2.4B model with a sparsity ratio of 93.52\%, showing 4.1$\times$ speedup compared with its dense version. The codes and checkpoints are available at https://github.com/thunlp/SparsingLaw/.
Lay Summary: We study activation sparsity, a widely existing phenomenon in most LLMs that benefits computation efficiency and interpretability. Three underexplored research questions are addressed in this work: (1) How can activation sparsity be measured more accurately? (2) How is activation sparsity affected by the model architecture and training process? (3) How can we build a more sparsely activated and efficient LLM? First, we propose a more general and performance-friendly metric for activation sparsity, named CETT-PPL-1\%. Next, comprehensive experiments are conducted to reveal the quantitative influence of four factors on activation sparsity: the amount of training data, the activation function, the width-depth ratio, and the parameter scale. Finally, we summarize the implications for building more sparsely-activated efficient LLMs, including the sparsity-promoting benefits of more training data, the ReLU activation (compared with SiLU), and a smaller width-depth ratio. The insensitiveness of sparsity to parameter scale is also a surprising and interesting observation. Our paper offers a more accurate paradigm for inspecting the sparsity level of an LLM. The empirical laws found in this work can provide instructional values for designing and pre-training an LLM with greater activation sparsity, which helps produce more efficient LLMs.
Link To Code: https://github.com/thunlp/SparsingLaw/
Primary Area: Deep Learning->Large Language Models
Keywords: activation sparsity, large language model
Submission Number: 1995
Loading