Demons in the Detail: On Implementing Load Balancing Loss for Training Specialized Mixture-of-Expert Models
Abstract: This paper revisits the implementation of \textbf{L}oad-\textbf{b}alancing \textbf{L}oss (LBL) when training Mixture-of-Experts (MoEs) models. Specifically, LBL for MoEs is defined as $N_E \sum_{i=1}^{N_E} f_ip_i$, where $N_E$ is the total number of experts, $f_i$ represents the frequency of expert $i$ being selected, and $p_i$ denotes the average gating score of the expert $i$.
Existing MoE training frameworks usually employ the parallel training strategy so that $f_i$ and the LBL are calculated within a \textbf{micro-batch} and then averaged across parallel groups.
In essence, a micro-batch for training billion-scale LLMs normally contains very few sequences.
So, the micro-batch LBL is almost at the sequence level, and the router is pushed to distribute the token evenly within each sequence.
Under this strict constraint, even tokens from a domain-specific sequence (\textit{e.g.}, code) are uniformly routed to all experts,
thereby inhibiting expert specialization.
In this work, we propose calculating LBL using a \textbf{global-batch} to loose this constraint.
Because a global-batch contains much more diverse sequences than a micro-batch, which will encourage load balance at the corpus level.
Specifically, we introduce an extra communication step to synchronize $f_i$ across micro-batches and then use it to calculate the LBL.
Through experiments on training MoEs-based LLMs (up to \textbf{42.8B} total parameters and \textbf{400B} tokens), we surprisingly find that the global-batch LBL strategy yields excellent performance gains in both pre-training perplexity and downstream tasks.
Our analysis reveals that the global-batch LBL greatly improves the domain specialization of MoE experts.
Paper Type: Long
Research Area: Language Modeling
Research Area Keywords: pre-training; scaling; sparse models
Contribution Types: NLP engineering experiment
Languages Studied: English; Chinese
Submission Number: 21
Loading