Efficient Routing in Sparse Mixture-of-Experts

Masoumeh Zareapoor, Pourya Shamsolmoali, Fateme Vesaghati

Published: 01 Jan 2024, Last Modified: 07 Oct 2024IJCNN 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Sparse Mixture-of-Experts (MoE) architectures provide the distinct benefit of substantially expanding the model’s parameter space without proportionally increasing the computational load on individual input tokens or samples. However, the efficacy of these models heavily depends on the routing strategy used to assign tokens to experts. Poor routing can lead to under-trained or overly specialized experts, diminishing the overall model performance. Previous approaches have relied on the Topk router, where each token is assigned to a subset of experts. In this paper, we propose a routing mechanism that replaces the Topk router with regularized optimal transport, leveraging the Sinkhorn algorithm to optimize token-expert matching. We conducted a comprehensive evaluation comparing the pre-training efficiency of our model, using computational resources equivalent to those employed in the GShard and Switch Transformers gating mechanisms. The results demonstrate that our model expedites training convergence, achieving a speedup of over 2× compared to these baseline models. Moreover, under the same computational constraints, our model exhibits superior performance across eleven tasks from the GLUE and SuperGLUE benchmarks. We show that our model contributes to the optimization of token-expert matching in sparsely-activated MoE models, offering substantial gains in both training efficiency and task performance.