FlashMoE: Fast Distributed MoE in a Single Kernel

Osayamen Jonathan Aimuyo; Byungsoo Oh; Rachee Singh

FlashMoE: Fast Distributed MoE in a Single Kernel

Osayamen Jonathan Aimuyo, Byungsoo Oh, Rachee Singh

Published: 18 Sept 2025, Last Modified: 29 Oct 2025NeurIPS 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Distributed Machine Learning, Kernel Fusion, Mixture-of-Experts

TL;DR: FlashMoE is the first work to completely fuse the Distributed Mixture-of-Experts operator into a single persistent GPU kernel with elegant actor-style concurrency and high-throughput tile parallelism, achieving up to 9× higher GPU utilization.

Abstract: The computational sparsity of Mixture-of-Experts (MoE) models enables sub-linear growth in compute cost as model size increases, thus offering a scalable path to training massive neural networks. However, existing implementations suffer from low GPU utilization, significant latency overhead, and a fundamental inability to leverage task locality, primarily due to CPU-managed scheduling, host-initiated communication, and frequent kernel launches. To overcome these limitations, we develop FlashMoE, a fully GPU-resident MoE operator that fuses expert computation and inter-GPU communication into a single persistent GPU kernel. FlashMoE enables fine-grained pipelining of dispatch, compute, and combine phases, eliminating launch overheads and reducing idle gaps. Unlike existing work, FlashMoE obviates bulk-synchronous collectives for one-sided, device-initiated, inter-GPU (R)DMA transfers, thus unlocking payload efficiency, where we eliminate bloated or redundant network payloads in sparsely activated layers. When evaluated on an 8-H100 GPU node with MoE models having up to 128 experts and 16K token sequences, FlashMoE achieves up to 9× higher GPU utilization, 6× lower latency, 5.7× higher throughput, and 4× better overlap efficiency compared to state-of-the-art baselines—despite using FP32 while baselines use FP16. FlashMoE shows that principled GPU kernel-hardware co-design is key to unlocking the performance ceiling of large-scale distributed ML. We provide code at https://github.com/osayamenja/FlashMoE.

Primary Area: Infrastructure (e.g., libraries, improved implementation and scalability, distributed solutions)

Submission Number: 26277

Loading