Universal Properties of Activation Sparsity in Modern Large Language Models

ICLR 2026 Conference Submission576 Authors

01 Sept 2025 (modified: 23 Dec 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: LLMs, activation sparsity, efficiency, representations
TL;DR: We provide an overview of activation sparsity in modern LLM architectures and highlight a set of interesting properties of the activations across different model families.
Abstract: Activation sparsity is an intriguing property of deep neural networks that has been extensively studied in ReLU-based models, due to its advantages for efficiency, robustness, and interpretability. However, methods relying on exact zero activations do not directly apply to modern Large Language Models (LLMs), leading to fragmented, model-specific strategies for LLM activation sparsity and a gap in its general understanding. In this work, we introduce a general framework for evaluating sparsity robustness in contemporary LLMs and conduct a systematic investigation of this phenomenon in their feedforward~(FFN) layers. Our results uncover universal properties of activation sparsity across diverse model families and scales. Importantly, we observe that the potential for effective activation sparsity grows with model size, highlighting its increasing relevance as models scale. Furthermore, we present the first study of activation sparsity in diffusion-based LLMs. Overall, our work provides a comprehensive perspective and practical guidance for harnessing activation sparsity in LLM design and acceleration.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 576
Loading