Towards an empirical understanding of Mixture of Experts Design Choices

ICLR 2024 Workshop ME-FoMo Submission62 Authors

Published: 04 Mar 2024, Last Modified: 05 May 2024ME-FoMo 2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Mixture of Experts, Expert Specialization, Routing Mechanism
TL;DR: This study delves into the training and design choices of Mixture of Experts (MoEs), focusing on their impact on model performance and expert specialization.
Abstract: In this study, we systematically evaluate the impact of common design choices in Mixture of Experts on validation performance, uncovering distinct influences when routing at a token or sequence level. We also present empirical evidence showing comparable performance between a learned router and a frozen, randomly initialized router, suggesting that learned routing may not be essential. Our study further reveals that Sequence-level routing can result in topic-specific weak expert specialization, in contrast to syntax specialization observed with Token-level routing. The topic understanding is independent of the languages used.
Submission Number: 62
Loading