Scaling Laws for Fully Sparsely-Activated Large Language Models

Hongyu Wang; Shuming Ma; Ruiping Wang; Furu Wei

Scaling Laws for Fully Sparsely-Activated Large Language Models

Hongyu Wang, Shuming Ma, Ruiping Wang, Furu Wei

18 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Activation sparsity, scaling law, large language models

TL;DR: In this work, we investigate the architecture and scaling laws for fully sparsely-activated models, where every activation in linear transformations is sparse.

Abstract: Scaling laws play a crucial role in understanding and optimizing Large Language Models (LLMs). While previous work on scaling laws has primarily focused on either fully dense models or models with sparse Mixture of Experts (MoE), our work investigates fully sparsely-activated models, where every activation in linear transformations is sparse. We derive scaling laws for these models through extensive experiments with varying model sizes, training token counts, and activation sparsity ratios. Our findings demonstrate that fully sparsely-activated LLMs exhibit favorable scaling properties: as the total model size increases, LLMs can maintain higher activation sparsity while the performance gap between sparsely-activated and dense models narrows. Notably, our scaling laws indicate that a sparsely-activated full-precision model with a 45.58% sparsity ratio achieves optimal performance while maintaining the same number of active parameters. Furthermore, our scaling laws remain applicable to 1-bit pre-training of LLMs, suggesting promising directions for improving the efficiency of future models.

Primary Area: foundation or frontier models, including LLMs

Submission Number: 11921

Loading