GRACE-MoE: Grouping and Replication with Locality-Aware Routing for Efficient Distributed MoE Inference

Yu Han; Lehan Pan; Jie Peng; Ziyang Tao; Wuyang Zhang; Yanyong Zhang

GRACE-MoE: Grouping and Replication with Locality-Aware Routing for Efficient Distributed MoE Inference

Yu Han, Lehan Pan, Jie Peng, Ziyang Tao, Wuyang Zhang, Yanyong Zhang

20 Sept 2025 (modified: 27 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Mixture of Experts, Large Language Model, Efficient Inference

TL;DR: We propose a co-optimization framework that reduces communication overhead and balances computational load across devices for efficient distributed SMoE inference.

Abstract: Sparse Mixture of Experts (SMoE) performs conditional computation by selectively activating a subset of experts, thereby enabling scalable parameter growth in large language models (LLMs). However, the expanded parameter scale exceeds the memory capacity of a single device, necessitating distributed deployment for inference. This setup introduces two critical challenges: (1) *Communication Issue*: Transferring features to devices with activated experts leads to significant communication overhead. (2) *Computational Load Issue*: Skewed expert activation overloads certain GPUs, resulting in load imbalance across devices. Among these, communication overhead is identified as the main bottleneck in SMoE inference. Nevertheless, reducing communication between devices may exacerbate load imbalance, leading to device idleness and resource waste. Therefore, we present **GRACE-MoE**, short for **G**rouping and **R**eplic**a**tion with Lo**c**ality-Awar**e** Routing for S**MoE** inference. **GRACE-MoE** is a co-optimization framework that jointly reduces communication overhead and alleviates computational load imbalance. Specifically, the framework comprises two key phases: ① *Grouping & Replication*: This phase groups experts based on their affinity to reduce cross-device communication. Additionally, dynamic replication is applied to address load skewness, improving computational load balance across GPUs. ② *Routing*: This phase employs a locality-aware routing strategy with load prediction. It prioritizes local replicas to minimize communication overhead and balances requests across remote replicas when necessary. Experiments on diverse models and multi-node, multi-GPU environments demonstrate that **GRACE-MoE** efficiently reduces end-to-end inference latency, achieving up to **3.79×** speedup over state-of-the-art systems. Code for **GRACE-MoE** will be released upon acceptance.

Primary Area: other topics in machine learning (i.e., none of the above)

Submission Number: 25328

Loading