Abstract: As machine learning models scale in size and complexity, their computational requirements become a significant barrier. Mixture-of-Experts (MoE) models alleviate this issue by selectively activating relevant experts. Despite this, MoE models are hindered by high communication overhead from all-to-all operations, low GPU utilization, and complications from heterogeneous GPU environments. This paper presents Comet, which optimizes both model deployment and all-to-all communication scheduling to address these challenges in MoE inference. Comet achieves minimal communication times by strategically ordering token transmissions in all-to-all communications. It improves GPU utilization by colocating experts from different models on the same device, avoiding the limitations of all-to-all communication. We analyze Comet’s optimization strategies theoretically across four common GPU cluster settings: exclusive vs. colocated models on GPUs, and homogeneous vs. heterogeneous GPUs. Comet provides optimal solutions for three cases, and for the remaining NP-hard scenario, it offers a polynomial-time sub-optimal solution with only a 1.09× degradation from the optimal, as shown in the simulation results. Comet is the first approach to minimize MoE inference time via optimal model deployment and communication scheduling across various scenarios. Evaluations demonstrate that Comet significantly accelerates inference, achieving speedups of up to 2.63× in homogeneous clusters and 2.91× in heterogeneous environments. Moreover, Comet enhances GPU utilization by up to 2.38× compared to existing methods.
External IDs:doi:10.1109/ton.2025.3645806
Loading