TL;DR: This paper presents an empirical investigation into the trade-off between parameter count and FLOPs per token in scaling Mixture-of-Experts (MoE) language models.
Abstract: Scaling the capacity of language models has consistently proven to be a reliable approach for improving performance and unlocking new capabilities. Capacity can be primarily defined by two dimensions: the number of model parameters and the compute per example. While scaling typically involves increasing both, the precise interplay between these factors and their combined contribution to overall capacity remains not fully understood. We explore this relationship in the context of sparse Mixture-of-Expert models (MoEs), which allow scaling the number of parameters without proportionally increasing the FLOPs per example. We investigate how varying the sparsity level, i.e., the ratio of non-active to total parameters, affects model performance in terms of both pretraining and downstream performance. We find that under different constraints (e.g. parameter size and total training compute), there is an optimal level of sparsity that improves both training efficiency and model performance.These results provide a better understanding of the impact of sparsity in scaling laws for MoEs and complement existing works in this area, offering insights for designing more efficient architectures.
Lay Summary: (1) The capacity of language models to fit the training data depends on both the size of language models as well as the amount of compute spent per example. While it is known that increasing the size of the language leads to improved performance, less is known about how to tradeoff the two factors listed above in an optimal manner. We study this trade-off using the popular Mixture-of-Experts (MoE) models which can have many parameters but only use some of them for each input token. The ratio of active to total parameter is called sparsity.
(2) We conduct a large scae empirical study to systematically study how the sparsity level can be set in an optimal manner under a variety of settings including fixed training budget and fixed model size.
(3) We find that total parameter count plays a significant role during training under a fixed compute budget while the compute per token may have a bigger influence on downstream task performance.
Primary Area: Deep Learning->Large Language Models
Keywords: Mixture-of-Experts, Scaling Laws
Submission Number: 5486
Loading