On the Benefits of Learning to Route in Mixture-of-Experts Models

Published: 07 Oct 2023, Last Modified: 01 Dec 2023EMNLP 2023 MainEveryoneRevisionsBibTeX
Submission Type: Regular Long Paper
Submission Track: Theme Track: Large Language Models and the Future of NLP
Submission Track 2: Machine Learning for NLP
Keywords: mixture-of-experts, transformer, router, efficiency, conditional compute, sparsely activated models, theory
TL;DR: We study the role of the router in Mixture-of-Expert models and show empirical and theoretical evidence that a learnable router is better than a non-trainable one, (ii) the router can learn to discover latent cluster structure.
Abstract: Mixture-of-Expert (MoE) Transformer models, such as the Switch Transformer, allow us to successfully scale up model sizes while keeping the amount of compute time fixed. Prior work has established the computational efficiency benefits of using these models. A core component of these models is a router that routes input tokens to different experts in a layer. We show theoretical and empirical evidence that the router's ability to route tokens intelligently confers a significant advantage to MoE models. We study synthetic settings where the input data is distributed in clusters and show theoretically and empirically that the router learns to route the inputs according to these clusters. Then we perform experiments on real data using the T5X library, where we observe that a trainable router confers a non-trivial benefit instead of a non-trainable router.
Submission Number: 1124
Loading