VRouter: Micro-batch Level Load Balance via Inter-EP Routing for MoE Training

Haiquan Wang; Zhipeng Zhang; Guanshujie Fu; Youhui Bai; Jiangfei Duan; Yuan Man; Langshi Chen; Hongqing Chen; Siyu Wang; Xiulong Yuan; Yunfei Mao; Si Chang; Linlang Jiang; Yingtao Li; Yan Wang; Yong Li; Wei Lin; Cheng Li

VRouter: Micro-batch Level Load Balance via Inter-EP Routing for MoE Training

Haiquan Wang, Zhipeng Zhang, Guanshujie Fu, Youhui Bai, Jiangfei Duan, Yuan Man, Langshi Chen, Hongqing Chen, Siyu Wang, Xiulong Yuan, Yunfei Mao, Si Chang, Linlang Jiang, Yingtao Li, Yan Wang, Yong Li, Wei Lin, Cheng Li

18 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: MoE, Pre-training, Expert Parallelism, Machine Learning System

Abstract: Load imbalance within the Expert Parallel (EP) group leads to poor GPU efficiency when pre-training large-scale Mixture-of-Experts (MoE) models. Though recent approaches have attempted to mitigate this through dynamic expert rearrangement at the global-batch level, they overlook the rapid and dynamic variations in load distribution across different micro-batches. Additionally, relocating or shadowing popular experts at micro-batch level incurs substantial communication overhead due to frequent migrations of expert parameters and gradients. To address these issues, we introduce VRouter, a novel Inter-EP routing system that achieves better load balance at the micro-batch level, without requiring any expert migration or replication. We have three key techniques: (1) VRouter utilizes the expert shifting strategy that allows workloads to be redistributed across neighboring devices, creating additional opportunities for balancing, (2) VRouter adopts expert dropping mechanism to reduce both per-device memory footprint and gradient synchronization overhead across EP groups, by selectively dropping experts while preserving load balance, and (3) VRouter applies a lightweight load-aware token routing algorithm that redistributes load across devices uniformly. Experimental evaluations on representative MoE models demonstrate that VRouter achieves 1.05-1.13$\times$ throughput speedup over existing routing systems.

Primary Area: infrastructure, software libraries, hardware, systems, etc.

Submission Number: 11490

Loading