MoE-GPS: Guidlines for Prediction Strategy with Expert Duplication in MoE Load Balancing

Haiyue Ma; Zhixu Du; Yiran Chen

MoE-GPS: Guidlines for Prediction Strategy with Expert Duplication in MoE Load Balancing

Haiyue Ma, Zhixu Du, Yiran Chen

Published: 30 Oct 2025, Last Modified: 04 Nov 2025MLForSys2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Mixture of Experts, Load Balancing, Machine Learning Systems

Abstract: In multi-GPU Mixture-of-Experts (MoE) networks, distributing experts across GPUs leads to load imbalance as token assignments vary. Recent methods address this by duplicating popular experts on additional GPUs, requiring accurate prediction of token distributions before routing. This paper examines the tradeoffs between prediction strategy, accuracy, overhead, and system performance. We introduce MoE-GPS, a framework that quantifies these impacts and identifies optimal predictor designs for various system settings. Our results highlight Distribution-Only Prediction, which predicts coarse token distribution with much lower overhead than Token-to-Expert Prediction, achieving 23\% faster inference on the Mixtral 8×7B MMLU dataset.

Submission Number: 19

Loading