Parameters vs FLOPs: Scaling Laws for Optimal Sparsity for Mixture-of-Experts Language Models

Published: 05 Mar 2025, Last Modified: 30 Mar 2025SLLMEveryoneRevisionsBibTeXCC BY 4.0
Track: tiny / short paper (up to 4 pages)
Keywords: MoE, Scaling Laws, MoE Scaling Law, FLOPs, Parameters
TL;DR: This paper presents an empirical investigation into the trade-off between parameter count and FLOPs per token in scaling Mixture-of-Experts (MoE) language models.
Abstract: Scaling the capacity of language models has consistently proven to be a re- liable approach for improving performance and unlocking new capabilities. Capacity can be primarily defined by two dimensions: the number of model parameters and the compute per example. While scaling typically involves increasing both, the precise interplay between these factors and their com- bined contribution to overall capacity remains not fully understood. We explore this relationship in the context of sparse Mixture-of-Expert models (MoEs), which allow scaling the number of parameters without proportion- ally increasing the FLOPs per example. We investigate how varying the sparsity level, i.e., the fraction of inactive parameters, impacts model’s per- formance during pretraining and downstream evaluation. We find that un- der different constraints (e.g., parameter size and total training compute), there is an optimal level of sparsity that improves both training efficiency and model performance. These results provide a better understanding of the impact of sparsity in scaling laws for MoEs and complement existing works in this area, offering insights for designing more efficient architec- tures.
Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.
Submission Number: 33
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview