Keywords: Mixture-of-Experts, Block-sparse kernels, MegaBlocks, Triton
Abstract: MegaBlocks is the state-of-the-art system for efficient training of MoE models based on block-sparse matrix multiplication kernels. The library is currently restricted to a specific block size in the sparse matrices, data type, and GPU architecture. This is due to the CUDA kernels used for the block-sparse matrix products in the MoE layers. These kernels have been hand-tuned and manually optimized to obtain the highest performance for a specific choice of parameters.
In this work, we evaluate re-writing these kernels in Triton, a Python-embedded domain specific language (DSL) for high-performance kernels for GPUs. We show that it is possible to achieve same levels of performance as the hand-tuned CUDA kernels, while maintaining portability across GPU architectures and easily supporting different block sizes and data types without any code changes. We identify the challenges and advantages of using Triton in implementing these block-sparse matrix multiplication kernels.
Submission Number: 49
Loading