CAMoE: Cost-Aware Communication Optimization for Mixture-of-Experts Inference

03 Sept 2025 (modified: 17 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: LLM Inference, Mixture-of-Experts
TL;DR: We propose CAMoE, a cost-aware method for reducing all-to-all communication latency during inference of large-scale Mixture-of-Experts (MoE) models.
Abstract: Mixture-of-Experts (MoE) is currently the most promising method for scaling the parameters of large language models. Its architecture consists of different experts at different layers, with a fixed number of top experts selected dynamically for each token based on the token’s information during inference. Ideally, if all experts could be placed on the same device, token routing would not be impeded by communication overhead. However, as the parameters of MoE models grow toward trillion-scale, experts cannot be accommodated on a single device or even a single node, leading to significantly increased tail latency during all-to-all communications—the tokens with the highest communication cost slow down the inference process. In this paper, we thoroughly analyze the patterns of all-to-all communications during inference in MoE models and develop a profiler to measure heterogeneity between devices. Using parameters obtained from profiler runs, we implement a SystemC-based simulator to model communication times during all-to-all communications. Based on detailed information about transmitted data, we propose a cost-aware method designed to reduce tail latency during model inference. Experimental results demonstrate that this method does not affect model accuracy on downstream tasks and effectively reduces all-to-all communication time during inference.
Primary Area: infrastructure, software libraries, hardware, systems, etc.
Submission Number: 1452
Loading