Scattered Mixture-of-Experts Implementation

Shawn Tan; Yikang Shen; Rameswar Panda; Aaron Courville

Scattered Mixture-of-Experts Implementation

Shawn Tan, Yikang Shen, Rameswar Panda, Aaron Courville

Published: 10 Jul 2024, Last Modified: 26 Aug 2024COLMEveryoneRevisionsBibTeXCC BY 4.0

Research Area: Engineering for large LMs

Keywords: triton, gpu, moe, mixture of experts, sparse mixture of experts

TL;DR: Triton-based implementation of Sparse Mixture of Experts without padded copies.

Abstract: ScatterMoE is an implementation of Sparse Mixture-of-Experts (SMoE) on GPUs. ScatterMoE builds upon techniques in existing implementations, and overcoming some of the current limitations to improve batched inference, training speed, and memory footprint. This implementation achieves this by avoiding padding and making excessive copies of the input. We also fuse expert linear transforms and reordering operations with ParallelLinear, a module that can be used to extend the concept of SMoEs. We benchmark our implementation against Megablocks, and show that it enables a higher throughput and lower memory footprint. We also show how ParallelLinear enables extension of the Mixture-of-Experts concept by demonstrating with an implementation of Mixture-of-Attention.

Supplementary Material: zip

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the COLM Code of Ethics on https://colmweb.org/CoE.html

Author Guide: I certify that this submission complies with the submission instructions as described on https://colmweb.org/AuthorGuide.html

Submission Number: 322

Loading