Occult: Optimizing Collaborative Communications across Experts for Accelerated Parallel MoE Training and Inference
TL;DR: Optimizing all-to-all communication in expert parallelism using algorithm-system co-design
Abstract: Mixture-of-experts (MoE) architectures could achieve impressive computational efficiency with expert parallelism, which relies heavily on all-to-all communication across devices. Unfortunately, such communication overhead typically constitutes a significant portion of the total runtime, hampering the scalability of distributed training and inference for modern MoE models (consuming over 40% runtime in large-scale training). In this paper, we first define $\textit{collaborative communication}$ to illustrate this intrinsic limitation, and then propose system- and algorithm-level innovations to reduce communication costs. Specifically, given a pair of experts co-activated by one token, we call them as $\textit{collaborated}$, which comprises $2$ cases as $\textit{intra-}$ and $\textit{inter-collaboration}$, depending on whether they are kept on the same device. Our pilot investigations reveal that augmenting the proportion of intra-collaboration can accelerate expert parallel at scale. It motivates us to strategically $\underline{\texttt{o}}$ptimize $\underline{\texttt{c}}$ollaborative $\underline{\texttt{c}}$omm$\underline{\texttt{u}}$nication for acce$\underline{\texttt{l}}$era$\underline{\texttt{t}}$ed MoE training and inference, dubbed $\textbf{\texttt{Occult}}$. Our designs are capable of $\underline{either}$ delivering exact results with reduced communication cost, $\underline{or}$ controllably minimizing the cost with collaboration pruning, materialized by modified fine-tuning. Comprehensive experiments on various MoE-LLMs demonstrate that $\texttt{Occult}$ can be faster than popular state-of-the-art inference or training frameworks (over 50% speed up across multiple tasks and models) with comparable or superior quality compared to the standard fine-tuning. Codes will be available upon acceptance.
Lay Summary: $\textbf{Motivation}$: Training and inference in MoE-based large language models (LLMs) face a critical bottleneck: expert parallelism—a distributed computing strategy—incurs heavy synchronization costs due to frequent "all-to-all" communication across devices. This communication alone accounts for a substantial portion of runtime, making its optimization essential for reducing latency and improving system throughput.
$\textbf{Key Insight}$: When multiple experts activated by the same token reside on the same device, transmitting redundant copies of that token becomes unnecessary. For example, if two co-activated experts are colocated, only one replica of the token needs communication, halving the data volume versus current frameworks that naively send duplicates. In ideal scenarios where all $\textit{k}$ experts per token are colocated, communication volume drops by $\textbf{(k-1)/k}$ compared to standard top-$\textit{k}$ routing, promising transformative savings.
$\textbf{Our Solution}$: We propose an algorithm-system co-design framework dubbed $\texttt{Occult}$ to exploit this insight:
1. $\textbf{Expert Placement Optimization}$: A novel algorithm dynamically reschedules expert-device assignments to maximize colocation of co-activated experts.
2. $\textbf{Communication-Aware Execution}$: Redesigning all-to-all communication in standard expert parallelism to avoid redundant data transfers while preserving computational accuracy.
3. $\textbf{Customized Sparse MatMul Kernels}$: A sparse matmul kernel tailored for the optimized communication strategy.
$\textbf{Impact}$: Our proposed $\texttt{Occult}$ significantly reduces synchronization overhead in distributed data center clusters, particularly for environments with constrained bandwidth. It offers immediate benefits to improve efficiency for training and serving MoE-based LLMs, addressing a critical challenge in deploying large-scale MoE-based models.
Link To Code: https://github.com/UNITES-Lab/Occult
Primary Area: Deep Learning->Large Language Models
Keywords: Mixture-of-Experts, Communication Efficiency, Expert Parallelism
Submission Number: 2697
Loading