VRouter: Micro-batch Level Load Balance via Inter-EP Routing for MoE Training

18 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: MoE, Pre-training, Expert Parallelism, Machine Learning System
Abstract: Load imbalance within the Expert Parallel (EP) group leads to poor GPU efficiency when pre-training large-scale Mixture-of-Experts (MoE) models. Though recent approaches have attempted to mitigate this through dynamic expert rearrangement at the global-batch level, they overlook the rapid and dynamic variations in load distribution across different micro-batches. Additionally, relocating or shadowing popular experts at micro-batch level incurs substantial communication overhead due to frequent migrations of expert parameters and gradients. To address these issues, we introduce VRouter, a novel Inter-EP routing system that achieves better load balance at the micro-batch level, without requiring any expert migration or replication. We have three key techniques: (1) VRouter utilizes the expert shifting strategy that allows workloads to be redistributed across neighboring devices, creating additional opportunities for balancing, (2) VRouter adopts expert dropping mechanism to reduce both per-device memory footprint and gradient synchronization overhead across EP groups, by selectively dropping experts while preserving load balance, and (3) VRouter applies a lightweight load-aware token routing algorithm that redistributes load across devices uniformly. Experimental evaluations on representative MoE models demonstrate that VRouter achieves 1.05-1.13$\times$ throughput speedup over existing routing systems.
Primary Area: infrastructure, software libraries, hardware, systems, etc.
Submission Number: 11490
Loading