Cluster Topology‑Driven Placement of Experts Reduces Network Traffic in MoE Inference

ICLR 2026 Conference Submission20913 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Mixture of Experts, integer linear programming, inference, cluster network topology
TL;DR: Expert usage statistics' aware Integer Linear Programming consistently provide communication-efficient expert placement for cluster scale MoE inference
Abstract: Efficient deployment of a pre-trained LLM to a cluster with multiple nodes is a critical step for providing fast responses of the service to users' queries. The recent success of Mixture-of-Experts (MoE) LLMs raises the question of how to deploy them efficiently, considering their underlying structure. During the inference in MoE LLMs, only a small part of the experts is selected to process a given token. Moreover, in practice, the experts' load is highly imbalanced. For efficient deployment, one has to distribute the model across a large number of servers using a model placement algorithm. Thus, to improve cluster utilization, the model placement algorithm has to take into account the network topology. This work focuses on the efficient topology-aware placement of the pre-trained MoE LLMs in the inference stage. We propose an integer linear program (ILP) that determines the optimal placement of experts, minimizing the expected number of transmissions. Due to the internal structure, this optimization problem can be solved with a standard ILP solver. We demonstrate that ILP-based placement strategy yields lower network traffic than competitors for small-scale (DeepSeekMoE 16B) and large-scale (DeepSeek-R1 671B) models.
Supplementary Material: zip
Primary Area: generative models
Submission Number: 20913
Loading