Track: tiny / short paper (up to 4 pages)
Keywords: MoE, Scaling Laws, MoE Scaling Law, FLOPs, Parameters
TL;DR: This paper presents an empirical investigation into the trade-off between parameter count and FLOPs per token in scaling Mixture-of-Experts (MoE) language models.
Abstract: Scaling the capacity of language models has consistently proven to be a re-
liable approach for improving performance and unlocking new capabilities.
Capacity can be primarily defined by two dimensions: the number of model
parameters and the compute per example. While scaling typically involves
increasing both, the precise interplay between these factors and their com-
bined contribution to overall capacity remains not fully understood. We
explore this relationship in the context of sparse Mixture-of-Expert models
(MoEs), which allow scaling the number of parameters without proportion-
ally increasing the FLOPs per example. We investigate how varying the
sparsity level, i.e., the fraction of inactive parameters, impacts model’s per-
formance during pretraining and downstream evaluation. We find that un-
der different constraints (e.g., parameter size and total training compute),
there is an optimal level of sparsity that improves both training efficiency
and model performance. These results provide a better understanding of
the impact of sparsity in scaling laws for MoEs and complement existing
works in this area, offering insights for designing more efficient architec-
tures.
Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.
Submission Number: 33
Loading