MoE-Mamba: Efficient Selective State Space Models with Mixture of Experts

MoE-Mamba: Efficient Selective State Space Models with Mixture of Experts

ICLR 2024 Workshop ME-FoMo Submission40 Authors

Published: 04 Mar 2024, Last Modified: 05 May 2024ME-FoMo 2024 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Mamba, LLM, Mixture of Experts, MoE, conditional computation, SSM

TL;DR: Introducing MoE-Mamba, combining benefits of Mixture of Experts and SSMs.

Abstract: State Space Models (SSMs) have become serious contenders in the field of sequential modeling, challenging the dominance of Transformers. At the same time, Mixture of Experts (MoE) has significantly improved Transformer-based Large Language Models, including recent state-of-the-art open models. We propose that to unlock the potential of SSMs for scaling, they should be combined with MoE. We showcase this on Mamba, a recent SSM-based model that achieves remarkable performance. Our model, MoE-Mamba, outperforms Mamba and matches the performance of Transformer-MoE. In particular, MoE-Mamba reaches the same performance as Mamba in $2.35\times$ *fewer training steps* while preserving the inference performance gains of Mamba against Transformer.

Submission Number: 40

Loading