Abstract: Sparse expert models can achieve promising results with outrageous large amount of parameters but constant computation cost, and thus it has become a trend in model scaling. Still, it is a mystery how Mixture-of-Experts (MoE) layers leveraging the parameters with sparse activation bring quality gains. In this work, we investigate several key factors in sparse expert models. We find that load imbalance may not be a significant problem affecting model quality, and auxiliary balancing loss can be removed without significant performance degrade. We further discover that larger number of sparsely activated experts $k$ may not necessarily benefit the performance on the time basis, and we observe diminishing marginal utility that the performance gap gradually narrows with the increase in $k$. We take a step forward to propose a simple method called expert prototyping that splits experts into different prototypes and applies top-$k$ routing for each prototype in parallel. Our experiments demonstrate that the prototyping strategy improves the model quality, in comparison with further increasing to a larger $k$ with comparable computation cost to prototyping. Furthermore, we conduct an exploration on training extremely large-scale models, and we figure out that the strategy shows greater effectiveness in training larger models. Notably, we push the model scale to over $1$ trillion parameters on solely $480$ NVIDIA V100-32GB GPUs. The proposed giant model M6-T with expert prototyping achieves substantial speedup in convergence over the same-size baseline.
0 Replies
Loading