Keywords: Mixture of Experts, Load Balancing, Machine Learning Systems
Abstract: In multi-GPU Mixture-of-Experts (MoE) networks, distributing experts across GPUs leads to load imbalance as token assignments vary. Recent methods address this by duplicating popular experts on additional GPUs, requiring accurate prediction of token distributions before routing. This paper examines the tradeoffs between prediction strategy, accuracy, overhead, and system performance. We introduce MoE-GPS, a framework that quantifies these impacts and identifies optimal predictor designs for various system settings. Our results highlight Distribution-Only Prediction, which predicts coarse token distribution with much lower overhead than Token-to-Expert Prediction, achieving 23\% faster inference on the Mixtral 8×7B MMLU dataset.
Submission Number: 19
Loading