ParaSMoE: Enabling Parallelism Hot-Switch for Large Mixture-of-Experts Models

ParaSMoE: Enabling Parallelism Hot-Switch for Large Mixture-of-Experts Models

ICLR 2026 Conference Submission13921 Authors

18 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Mixture-of-Experts, Large Language Model, Infrastructure, Machine Learning System

Abstract: Mixture-of-Experts (MoE) models has been demonstrated to be an effective paradigm for scaling Large language Model (LLM) parameters to hundreds of billions. A key consideration of MoE inference is parallelism strategy, which defines how parameters are distributed across multiple GPUs, and consequently dictates the communication pattern across the GPUs during model inference. We make an key observation that the optimal parallelism configuration is highly dependent on workload characteristics, which are dynamic in practice, shaped by different latency requirements in serving and by the decreasing number of active sequences in rollout phase of reinforcement learning (RL) . We introduce ParaSMoE that adapts the parallelism strategy to workloads. The core is an efficient "hot-switch" mechanism that seamlessly transitions between Expert Parallelism (EP) and Tensor Parallelism (TP), unleashing its ability to dynamically select the optimal parallelism for any given workloads. Through elaborated multi-level communication overlapping, Our experiments shows ParaSMoE can convert Qwen3-235B MoE model from EP to multiple TP instances in 0.7 seconds, with negligible memory overhead. We further project its potential to speedup batch generation in RL rollout phase by 1.4–3.7$\times$.

Primary Area: infrastructure, software libraries, hardware, systems, etc.

Submission Number: 13921

Loading