MOESART: An Effective Sampling-based Router for Sparse Mixture of Experts

15 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX
Primary Area: general machine learning (i.e., none of the above)
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: Sparse mixture of experts, Routing in neural networks, Conditional computation
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
TL;DR: We propose a novel sampling-based approach for conditional computation in Sparse MoEs, leading to improved routing in standard image, recommendation and natural language processing tasks.
Abstract: The sparse Mixture-of-Experts (Sparse-MoE) is a promising framework for efficiently scaling up model capacity. This framework consists of a set of experts (subnetworks) and one or more routers. The routers activate only a small subset of the experts on a per-example basis, which can save on resources. Among the most widely used sparse routers are Top-k and its variants, which activate k experts for each example during training. While very effective at model scaling, these routers are prone to performance issues because of discontinuous nature of the routing problem. Differentiable routers have been shown to mitigate the performance issues of Top-k, but these are not k-sparse during training, which limits their utility. To address this challenge, we propose MOESART: a novel k-sparse routing approach, which maintains k-sparsity during both training and inference. Unlike existing routers, MOESART aims at learning a good k-sparse approximation of the classical, softmax router. We achieve this through carefully designed sampling and expert weighting strategies. We compare MOESART with state-of-the-art MoE routers, through large-scale experiments on 14 datasets from various domains, including recommender systems, vision, and natural language processing. MOESART achieves up to 16% (relative) reduction in out-of-sample loss on standard image datasets, and up to 15% (relative) improvement in AUC on standard recommender systems, over popular k-sparse routers, e.g., Top-k, V-MoE, Expert Choice Router and X-MoE. Moreover, for distilling natural language processing models, MOESART can improve predictive performance by 0.5% (absolute) on average over the Top-k router across 7 GLUE and 2 SQuAD benchmarks.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 449
Loading