Sparse MoEs meet Efficient EnsemblesDownload PDF

Published: 28 Jan 2022, Last Modified: 22 Oct 2023ICLR 2022 SubmittedReaders: Everyone
Keywords: Ensembles, Sparse MoEs, Robustness, Uncertainty Calibration, OOD detection, Efficient Ensembles, Large scale, Computer vision
Abstract: Machine learning models based on the aggregated outputs of submodels, either at the activation or prediction levels, lead to strong performance. We study the interplay of two popular classes of such models: ensembles of neural networks and sparse mixture of experts (sparse MoEs). First, we show that these two approaches have complementary features whose combination is beneficial. Then, we present partitioned batch ensembles, an efficient ensemble of sparse MoEs that takes the best of both classes of models. Extensive experiments on fine-tuned vision transformers demonstrate the accuracy, log-likelihood, few-shot learning, robustness, and uncertainty calibration improvements of our approach over several challenging baselines. Partitioned batch ensembles not only scale to models with up to 2.7B parameters, but also provide larger performance gains for larger models.
One-sentence Summary: We analyse and combine sparse MoE models with ensembles, to better understand the interplay between these two kinds of models, resulting in a new algorithm that provides the best of both worlds for vision transformers with up to 2.7B parameters.
Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 3 code implementations](https://www.catalyzex.com/paper/arxiv:2110.03360/code)
19 Replies

Loading