Parameters vs FLOPs: Scaling Laws for Optimal Sparsity for Mixture-of-Experts Language Models

Samira Abnar; Harshay Shah; Dan Busbridge; Alaaeldin El-Nouby; Joshua M. Susskind; Vimal Thilak

Parameters vs FLOPs: Scaling Laws for Optimal Sparsity for Mixture-of-Experts Language Models

Samira Abnar, Harshay Shah, Dan Busbridge, Alaaeldin El-Nouby, Joshua M. Susskind, Vimal Thilak

Published: 05 Mar 2025, Last Modified: 30 Mar 2025SLLMEveryoneRevisionsBibTeXCC BY 4.0

Track: tiny / short paper (up to 4 pages)

Keywords: MoE, Scaling Laws, MoE Scaling Law, FLOPs, Parameters

TL;DR: This paper presents an empirical investigation into the trade-off between parameter count and FLOPs per token in scaling Mixture-of-Experts (MoE) language models.

Abstract: Scaling the capacity of language models has consistently proven to be a re- liable approach for improving performance and unlocking new capabilities. Capacity can be primarily defined by two dimensions: the number of model parameters and the compute per example. While scaling typically involves increasing both, the precise interplay between these factors and their com- bined contribution to overall capacity remains not fully understood. We explore this relationship in the context of sparse Mixture-of-Expert models (MoEs), which allow scaling the number of parameters without proportion- ally increasing the FLOPs per example. We investigate how varying the sparsity level, i.e., the fraction of inactive parameters, impacts model’s per- formance during pretraining and downstream evaluation. We find that un- der different constraints (e.g., parameter size and total training compute), there is an optimal level of sparsity that improves both training efficiency and model performance. These results provide a better understanding of the impact of sparsity in scaling laws for MoEs and complement existing works in this area, offering insights for designing more efficient architec- tures.

Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.

Submission Number: 33

Loading